Assignment 3: Philadelphia Housing Model—Technical Appendix

Predicting Residential Property Prices

Author

Ming Cao, Mark Deng, Jun Luu, Josh Rigsby, Alex Stauffer, Tess Vu

Published

October 27, 2025

Data Cleaning

Primary Philadelphia Sales Data (OpenDataPhilly)

# Import relevant libraries.
library(car)
library(dplyr)
library(ggcorrplot)
library(ggplot2)
library(grid)
library(gridExtra)
library(knitr)
library(kableExtra)
library(scales)
library(sf)
library(stringr)
library(tibble)
library(tidycensus)
library(tidyr)
library(tidyverse)
library(tigris)
library(tmap)
library(units)

options(scipen = 999)
# Property data.
properties_path <- "data/philly_properties.csv"
properties <- read.csv(properties_path)

# Capture dimensions.
og_property_dimension <- dim(properties)

# Set Census API key.
census_api_key("3aaee31789e10b674a531e9f236c35d5394b19ed")
# All variables are character strings, remove white space then convert numeric character variables to numeric classes for chosen variables.

properties <- properties %>%
  mutate(across(where(is.character), str_trim),
         across(c(fireplaces, garage_spaces, number_of_bathrooms, number_stories,
                  sale_price, total_livable_area, year_built), as.numeric)) %>%
  rename(fireplace_num = fireplaces,
         garage_num = garage_spaces,
         bath_num = number_of_bathrooms,
         story_num = number_stories,
         square_feet = total_livable_area,
         basement_type = basements,
         ac_binary = central_air,
         fuel_type = fuel,
         )
# Filter to residential properties and 2023-2024 sales.
# Note: Category code #1 is for residential.
residential_prop <- properties %>%
  filter(.,
         category_code == 1,
         startsWith(sale_date, "2023") | startsWith(sale_date, "2024"))

# Drop empty variables or variables not needed for model.
residential_prop <- residential_prop %>%
  select(c(basement_type, ac_binary, fireplace_num, fuel_type, garage_num,
           bath_num, story_num, sale_date, sale_price,
           square_feet, year_built, shape)
         )

# Make empty character column values NA.
residential_prop <- residential_prop %>%
  mutate(across(where(is.character), ~na_if(., "")))
# Drop prices that are less than $10,000 as a catch-all (might not be as reflective for rural areas). Avoiding dropping prices based on percent of assessed value since property assessments can be biased against minoritized communities. Ideal drop would add deed type to drop any family or forced transfers.
residential_prop <- residential_prop %>%
  filter(sale_price > 10000,
         square_feet > 0)
# Create building age column, change central air to binary, and adjust basement and fuel types.
# Create log value for the sale price.
residential_prop <- residential_prop %>%
  mutate(ln_sale_price = log(sale_price),
         age = 2025 - year_built,
         ln_square_feet = log(square_feet),
         ac_binary = case_when(
           ac_binary == "Y" ~ 1,
           ac_binary == "N" ~ 0),
         basement_type = case_when(
           basement_type == "0" ~ "None",
           basement_type == "A" ~ "Full Finished",
           basement_type == "B" ~ "Full Semi-Finished",
           basement_type == "C" ~ "Full Unfinished",
           basement_type == "D" ~ "Full Unknown Finish",
           basement_type == "E" ~ "Partial Finished",
           basement_type == "F" ~ "Partial Semi-Finished",
           basement_type == "G" ~ "Partial Unfinished",
           basement_type == "H" ~ "Partial Unknown Finish",
           basement_type == "I" ~ "Unknown Size Finished",
           basement_type == "J" ~ "Unknown Size Unfinished"),
         fuel_type = case_when(
           fuel_type == "A" ~ "Natural Gas",
           fuel_type == "B" ~ "Oil Heat",
           fuel_type == "C" ~ "Electric",
           fuel_type == "D" ~ "Coal",
           fuel_type == "E" ~ "Solar",
           fuel_type == "F" ~ "Wood",
           fuel_type == "G" ~ "Other",
           fuel_type == "H" ~ "None")
         )
# Turn categorical data into factors so OLS regression doesn't handle the data as a list of strings.
residential_prop$basement_type <- as.factor(residential_prop$basement_type)
residential_prop$fuel_type <- as.factor(residential_prop$fuel_type)

# Change the reference categories for baseline comparison.
residential_prop$basement_type <- relevel(residential_prop$basement_type, ref = "None")
residential_prop$fuel_type <- relevel(residential_prop$fuel_type, ref = "Natural Gas")

# Place fuel type with 10 or less counts into other category.
residential_prop <- residential_prop %>%
  mutate(fuel_type = fct_lump_min(fuel_type, min = 11, other_level = "Other"))
# Fixed effect temporal market fluctuations. Based on sale date, splitting the years into quarters (Q1, Q2, Q3, Q4). Potential fixed effect.
residential_prop <- residential_prop %>%
  mutate(
    quarters_fe = quarter(as_datetime(sale_date))
    )

# Make it a factor.
residential_prop$quarters_fe <- factor(residential_prop$quarters_fe)
# Capture dimensions.
after_property_dimension <- dim(residential_prop)

# Convert residential property to geodataframe. Use EPSG 2272 for South Pennsylvania in feet.
# Drop shape when finished creating geometry.
residential_prop_gdf <- residential_prop %>%
  mutate(geometry = st_as_sfc(shape)) %>%
  st_as_sf(crs = 2272) %>%
  rename(geometry_point = geometry) %>%
  select(-c(shape))

Spatial Data and Feature Engineering

Spatial Data (OpenDataPhilly)

# Read in Philadelpha census tracts.
philly_tracts_path <- "data/philly_tracts/philly_tracts.shp"
philly_tracts <- st_read(philly_tracts_path)
Reading layer `philly_tracts' from data source 
  `F:\Upenn\MUSA 5080 Public Policy Analytics\portfolio-setup-MarkD12138\assignments\assignment_3\data\philly_tracts\philly_tracts.shp' 
  using driver `ESRI Shapefile'
Simple feature collection with 3446 features and 12 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: -80.51985 ymin: 39.7198 xmax: -74.68956 ymax: 42.51607
Geodetic CRS:  NAD83
# Match CRS.
philly_tracts <- st_transform(philly_tracts, crs = 2272)

# Left spatial join.
residential_points <- st_join(residential_prop_gdf, philly_tracts)

# Drop unnecessary columns and remove incomplete observations (rows) for upcoming spatial feature computations.
residential_points <- residential_points %>%
  select(-c(FUNCSTAT, MTFCC, NAMELSAD, NAME,
            STATEFP, COUNTYFP, TRACTCE)) %>%
  na.omit(.)

# Remove unneeded datasets for housekeeping and call garbage collector to reduce memory.
rm(properties, residential_prop, residential_prop_gdf)
gc()
           used  (Mb) gc trigger   (Mb)  max used   (Mb)
Ncells  8653972 462.2   14727550  786.6  12442126  664.5
Vcells 73496097 560.8  139947442 1067.8 139947222 1067.8
# Proximity to downtown.

# Decided on Euclidean distance because network proximity computation is demanding on thousands of points, even with parallel programming.

# Create single Center City point feature based on City Hall coordinates.
center_city <- st_sfc(st_point(c(-75.163500, 39.952800)), crs = 4326) %>%
  st_transform(crs = 2272)

# Need to add mile units for operations. Then remove units object for easier plotting.
residential_points$city_dist_mi <- (st_distance(residential_points, st_union(center_city))) %>%
  set_units("mi") %>%
  drop_units() %>%
  as.numeric()

# Log transform because distance benefit diminishes, for potential use.
residential_points$ln_city_dist <- log(residential_points$city_dist_mi + 0.1)
# Transit proximity.
# Major cities could be distance to nearest transit like metro/rail stations, but suburban and rural areas might be better served by distance to nearest major highway.
# Read in SEPTA stops.
septa_stops_path <- "data/septa_stops.csv"

septa_stops_df <- read.csv(septa_stops_path)

# Make csv a geodataframe.
septa_stops <- septa_stops_df %>%
  st_as_sf(., coords = c("Lon", "Lat"), crs = 4326)

# Match CRS.
septa_stops <- septa_stops %>%
  st_transform(., crs = 2272)

# Stops are duplicated for the same station because the data includes directions for all cardinal directions as well as bus, rail, and trolley for the same location. This means a single station could have more than one point representing a single location residents go to commute.
# Create new column with stop name without the cardinal suffixes and keep only the unique station values.
septa_stops <- septa_stops %>%
  mutate(stations = if_else(
    str_detect(StopAbbr, "NO$|SO$|EA$|WE$|NE$|NW$|SE$|SW$"),
    str_sub(StopAbbr, end = -3),
    str_sub(StopAbbr)
    )
  ) %>%
  distinct(stations, .keep_all = TRUE)

# Create buffer zone for stops within a half mile. This is ~10 minute walk, depending on topography.
# Note: EPSG 2272 is measured in feet, not miles.
septa_distance <- st_buffer(residential_points, 2640)

# Create number of stops in the buffer zone.
septa_stations <- st_intersects(septa_distance, septa_stops)

# Append buffer zone counts and put into main tract data. Create a logged version for potential use as well because distance benefit tapers off.
residential_points <- residential_points %>%
  mutate(
    septa_half_mi = lengths(septa_stations),
    ln_septa_half_mi = log(septa_half_mi + 0.1)
  )
# Park proximity / size. Measuring distance is important for accessibility, but the size of the park often matters because a property near a block-sized pocket of green space is not equivalent to being near a large one like Wissahickon Valley Park.

# Read in geojson data.
parks_path <- "data/parks.geojson"

parks <- st_read(parks_path)
Reading layer `parks' from data source 
  `F:\Upenn\MUSA 5080 Public Policy Analytics\portfolio-setup-MarkD12138\assignments\assignment_3\data\parks.geojson' 
  using driver `GeoJSON'
Simple feature collection with 63 features and 18 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: -75.2837 ymin: 39.87048 xmax: -74.95865 ymax: 40.13191
Geodetic CRS:  WGS 84
# Match CRS and filter by parks.
parks <- parks %>%
  st_transform(., crs = 2272) %>%
  filter(str_detect(USE_, "PARK"))

# Get distance to the edge of the nearest park.
# Note: Don't try to do spatial operations in apply() and mutate().
# Distance matrix of residential properties to parks.
parks_matrix <- st_distance(residential_points, parks)

# Get the nearest distance for each point.
residential_points$parks_mile <- apply(parks_matrix, 1, min)

# Convert to miles.
residential_points$parks_mile <- as.numeric(residential_points$parks_mile) / 5280

# Log parks data for potential use because of diminishing distance benefits.
residential_points$ln_park_dist <- as.numeric(log(residential_points$parks_mile + 0.1))
# Convenience/Food points of interest. Using kNN to measure the density of these amenities rather than nearest amenity point.
amenities_path <- "data/osm_pois/osm_pois.shp"
amenities <- st_read(amenities_path)
Reading layer `osm_pois' from data source 
  `F:\Upenn\MUSA 5080 Public Policy Analytics\portfolio-setup-MarkD12138\assignments\assignment_3\data\osm_pois\osm_pois.shp' 
  using driver `ESRI Shapefile'
Simple feature collection with 65127 features and 4 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: -80.52111 ymin: 39.71816 xmax: -74.69473 ymax: 42.25797
Geodetic CRS:  WGS 84
# Filter amenities by convenience and food.
amenities <- amenities %>%
  filter(fclass %in% c(
    "atm", "bakery", "bank", "bar", "beauty_shop", "biergarten", "bookshop",
    "butcher", "cafe", "convenience", "department_store", "fast_food", "food_court",
    "greengrocer", "hairdresser", "kiosk", "laundry", "market_place", "pharmacy",
    "mall", "pub", "restaurant", "supermarket"
  )
         ) %>%
  st_transform(., crs = 2272)

# Distance matrix of residential properties to amenities.
amenities_matrix <- st_distance(residential_points, amenities)
# k-Nearest Neighbors (kNN) function.
knn_distance <- function(distance_matrix, k) {
  apply(distance_matrix, 1, function(distances){
    mean(as.numeric(sort(distances)[1:k]))
  })
}

# Create kNN feature for amenities. k = 4 to balance for urban and suburban areas, probably not as representative of rural areas.
residential_points <- residential_points %>%
  mutate(
    knn_amenity_mi = as.numeric(knn_distance(amenities_matrix, k = 4))
  )

# Convert to miles.
residential_points$knn_amenity_mi <- as.numeric(residential_points$knn_amenity_mi / 5280)
# Fixed effect neighborhoods.
neighborhoods_path <- "data/philadelphia_neighborhoods/philadelphia_neighborhoods.shp"
philly_neighborhoods <- st_read(neighborhoods_path)
Reading layer `philadelphia_neighborhoods' from data source 
  `F:\Upenn\MUSA 5080 Public Policy Analytics\portfolio-setup-MarkD12138\assignments\assignment_3\data\philadelphia_neighborhoods\philadelphia_neighborhoods.shp' 
  using driver `ESRI Shapefile'
Simple feature collection with 159 features and 5 fields
Geometry type: POLYGON
Dimension:     XY
Bounding box:  xmin: -75.28026 ymin: 39.86701 xmax: -74.95576 ymax: 40.13799
Geodetic CRS:  WGS 84
# Match CRS.
philly_neighborhoods <- philly_neighborhoods %>%
  st_transform(., crs = 2272)

# Join to residential points and rename to neighborhoods.
residential_points <- residential_points %>%
  st_join(., philly_neighborhoods) %>%
  rename(neighborhood_fe = MAPNAME)

# Make the neighborhoods a factor.
residential_points$neighborhood_fe <- relevel(factor(residential_points$neighborhood_fe), ref = "East Falls")

# Place neighborhoods with 10 or less sales into a small neighborhoods category.
residential_points <- residential_points %>%
  mutate(neighborhood_fe = fct_lump_min(neighborhood_fe, min = 11, other_level = "Small Neighborhoods"))
# Capture spatial feature dimensions.
after_feat_eng_dimension <- dim(residential_points)
# Spatial feature creation table, select spatial features into a separate data frame and drop geometry.
spatial_feature_df <- residential_points %>%
  select(c(city_dist_mi, ln_city_dist, septa_half_mi, ln_septa_half_mi,
           parks_mile, ln_park_dist, knn_amenity_mi)) %>%
  na.omit(.) %>%
  st_drop_geometry()

# Create a tibble from the selected spatial features.
spatial_summary <- tibble(
  "Spatial Features" = names(spatial_feature_df),
  "Description" = c("Distance from city (mi).", "Log of distance from city.", "Within 0.5mi of SEPTA station.",
                    "Log of 0.5 SEPTA station.", "Distance from nearest park (mi).",
                    "Log of distance from nearest park.", "k-Nearest Neighbors convenience and food amenities.")
)

# Make Kable of spatial features.
spatial_kable <- kable(spatial_summary,
                       caption = "Feature Engineered Variables",
                       format.args = list(big.mark = ",")
) %>%
  kable_styling(latex_options = "striped",
                full_width = FALSE) %>%
  column_spec(1, bold = TRUE, width = "5cm") %>%
  row_spec(0, color = "#f5f4f0", background = "#ff4100", bold = TRUE)

spatial_kable
Feature Engineered Variables
Spatial Features Description
city_dist_mi Distance from city (mi).
ln_city_dist Log of distance from city.
septa_half_mi Within 0.5mi of SEPTA station.
ln_septa_half_mi Log of 0.5 SEPTA station.
parks_mile Distance from nearest park (mi).
ln_park_dist Log of distance from nearest park.
knn_amenity_mi k-Nearest Neighbors convenience and food amenities.

Primary:

From the CSV, we are analyzing the conditions of basements, number of fireplaces, garage spaces, number of bathrooms, number of stories, the total livable area in square feet, the existence of central air, and type of fuel used on the property. We filtered residential category code with the sale dates between 2023 and 2024. We eliminated property prices <10k. Rather than adhering to the percentage of assessed value as a guide for this filter, for it could incorporate marginalized bias, filtering the property prices removes non-market transactions but still incorporates a wide diversity of communities.

The forced the central air characteristic to become binary rather than “Y” and “N” and made sure to turn the categorical data to factors. The reference categories for types of basements is “None” and for fuel type, it’s “Natural Gas”. We including a building age category computed from the year built data. We logged the square footage and the sale price to correct for right-skewedness.

Spatial:

We inserted the Philadelphia census tracts, changing the CRS to 2272, the ideal projection for SE Pennsylvania analysis. We decided to perform a log transformation on the following variables - city_dist (distance from City Hall in Center City), septa_half_mi (half mile buffer zone from all septa stops within the city geometry), and parks_dist (distance to edge of nearest park in miles) - because their effects on housing prices were non-linear. This transformation ensures that changes in proximity are measured more consistently across the range of distances, rather than being dominated by properties very close to these features.

Regarding amenities, we used k-NN (k nearest neighbors) to measure the density of amenity accessibility rather than individual point data. The amenities are as follows: ATM, bakery, bank, bar, beauty shop, biergarten, bookshop, butcher, café, convenience, department store, fast food, food court, greengrocer, hairdresser, kiosk, laundry, marketplace, pharmacy, mall, pub, restaurant, supermarket. We filtered by convenience and food, transformed the CRS to 2272 for consistency, and then developed a matrix. The distance was inverted, log-transformed to account for diminishing returns, and scaled it to produce a single numeric value in which higher, positive values indicate greater accessibility to amenities.

We included neighborhoods as fixed effects to help explain unknown, unquantifiable factors like cultural atmosphere and other neighborhood-specific factors we cannot statistically account for that may influence housing prices. It was converted into a factor so each neighborhood can receive its own baseline model. Fiscal quarters were also introduced as fixed effects; splitting a year into 4 quarters for unknown factors when it comes to purchasing property (e.g. people are more likely to buy real estate in spring and summer).

Census Data (TidyCensus)

# Open tidycensus data. Using 2023 data, because we are looking at sales 2023-2024
acs_vars <- load_variables(2023, "acs5", cache = TRUE)

# Get acs dimensions.
og_acs_dimension <- dim(acs_vars)

# The variables that we want from tidycensus
variables <- c(
  median_household_income = "B19013_001",
  total_pop = "B01003_001",
  poverty_white = "B17001A_001", # To get poverty percentage
  poverty_black = "B17001B_001",
  poverty_native = "B17001C_001",
  poverty_asian = "B17001D_001",
  poverty_islander = "B17001E_001",
  poverty_other = "B17001F_001",
  poverty_multiracial = "B17001G_001",
  male_18_24_bach = "B15001_009", # Tracts only show bachelor's degrees, unless we want to look at only people 25+
  male_25_34_bach = "B15001_017",
  male_35_44_bach = "B15001_025",
  male_45_64_bach = "B15001_033",
  male_65plus_bach = "B15001_041",
  female_18_24_bach = "B15001_050",
  female_25_34_bach = "B15001_058",
  female_35_44_bach = "B15001_066",
  female_45_64_bach = "B15001_074",
  female_65plus_bach = "B15001_082",
  total_vacant = "B25005_001", # To get vacancy percentage
  white_total_units = "B25032A_001", # Need total units to get percentage of single, detached units and vacant units.
  white_single_family = "B25032A_002",
  black_total_units = "B25032B_001",
  black_single_family = "B25032B_002",
  native_total_units = "B25032C_001",
  native_single_family = "B25032C_002",
  asian_total_units = "B25032D_001",
  asian_single_family = "B25032D_002",
  islander_total_units = "B25032E_001",
  islander_single_family = "B25032E_002",
  other_total_units = "B25032F_001",
  other_single_family = "B25032F_002",
  multiracial_total_units = "B25032G_001",
  multiracial_single_family = "B25032G_002",
  medhhinc_white = "B19013A_001", # Median Household Income
  medhhinc_black = "B19013B_001",
  medhhinc_native = "B19013C_001", 
  medhhinc_asian = "B19013D_001", 
  medhhinc_other = "B19013F_001",  # There is no tract data for native hawiian/pacific islander, I'm including it with other
  medhhinc_multiracial = "B19013G_001", 
  white_pop = "B01001A_001",
  black_pop = "B01001B_001",
  native_pop = "B01001C_001",
  asian_pop = "B01001D_001",
  islander_pop = "B01001E_001",
  other_pop = "B01001F_001",
  multiracial_pop = "B01001G_001"
)

# We are grouping our data by tracts
philly_tract_acs <- get_acs(
  geography = "tract",
  state = "PA",
  variables = variables,
  year = 2022,
  survey = "acs5",
  cache_table = TRUE, 
  output = "wide"
)
# Summing up the variables that we need to create our percentage variables
philly_tract_acs <- philly_tract_acs %>%
  mutate(
    total_poverty = poverty_whiteE + poverty_blackE + poverty_nativeE + poverty_asianE + poverty_islanderE + poverty_otherE + poverty_multiracialE, # Adding all poverty populations together 
    
    total_bach = male_18_24_bachE + male_25_34_bachE + male_35_44_bachE + male_45_64_bachE + male_65plus_bachE + female_18_24_bachE + female_25_34_bachE + female_35_44_bachE + female_45_64_bachE + female_65plus_bachE, #Adding all bachelors degrees together
    
    total_units = white_total_unitsE + black_total_unitsE + native_total_unitsE + asian_total_unitsE + islander_total_unitsE + other_total_unitsE + multiracial_total_unitsE, # Total housing units
    
    total_single_family = white_single_familyE + black_single_familyE + native_single_familyE + asian_single_familyE + islander_single_familyE + other_single_familyE + multiracial_single_familyE # Total single family homes
  )
# Creating our variables that we are going to analyze
philly_tract_acs <- philly_tract_acs %>%
  mutate(
    pct_poverty = (total_poverty/total_popE)*100, # Divide total poverty population by total population

    pct_bach = (total_bach/total_popE)*100, # Divide bachelor degree holders by total population
    
    pct_vacant = (total_vacantE/total_units)*100, # Divide vacant units by total housing units
    pct_vacant = ifelse(is.infinite(pct_vacant) | total_vacantE > total_units, 100, pct_vacant), # Fixing errors when units equal zero or high MOE
    
    pct_single_family = (total_single_family/total_units)*100, # Divide single family homes by total housing units
    
    medhhinc = 
  (ifelse(is.na(medhhinc_whiteE), 0, medhhinc_whiteE) * white_popE +
   ifelse(is.na(medhhinc_blackE), 0, medhhinc_blackE) * black_popE +
   ifelse(is.na(medhhinc_nativeE), 0, medhhinc_nativeE) * native_popE +
   ifelse(is.na(medhhinc_asianE), 0, medhhinc_asianE) * asian_popE +
   ifelse(is.na(medhhinc_otherE), 0, medhhinc_otherE) * (islander_popE + other_popE) +
   ifelse(is.na(medhhinc_multiracialE), 0, medhhinc_multiracialE) * multiracial_popE) / total_popE)
# For median household income, I had to turn all median household incomes that were NA to 0, so that it would not mess up the formula. 
# Multiplying median household income times population by race. There was no islander median household income, so I included it in other. All divided by the total population, to get the total median household income. 
# Creating a summary table 
philly_acs_summary <- philly_tract_acs %>%
  select(
    GEOID, 
    NAME,
    pct_poverty,
    pct_bach,
    pct_vacant,
    pct_single_family,
    medhhinc
  )

# Get after acs dimension.
after_acs_dimension <- dim(philly_acs_summary)
# Join primary and census data.
final_data <- residential_points %>%
  left_join(philly_acs_summary, by = "GEOID") %>%
  select(-c(sale_date, year_built, ALAND, AWATER,
            INTPTLAT, INTPTLON, NAME.x, LISTNAME, NAME.y,
            Shape_Leng, Shape_Area)
         )
# Create key variables list.
key_columns <- c("sale_price", "ln_sale_price", "square_feet", "ln_square_feet",
                 "bath_num", "fireplace_num", "garage_num", "ac_binary",
                 "story_num", "age", "city_dist_mi", "ln_city_dist",
                 "septa_half_mi", "ln_septa_half_mi", "parks_mile", "ln_park_dist",
                 "knn_amenity_mi", "pct_poverty", "pct_bach",
                 "pct_vacant", "pct_single_family", "medhhinc",
                 "basement_type", "fuel_type", "neighborhood_fe", "quarters_fe")

# Reorder for key columns first and drop all rows with NA because OLS needs complete observations.
final_data <- final_data %>%
  select(any_of(key_columns), everything()) %>%
  na.omit(.)

# Get final dimension.
final_dimension <- dim(final_data)
# Separate before/after dimensions for data.
dimensions <- data.frame(
  rows_columns = c("Rows", "Columns"),
  "Property Data Before" = og_property_dimension,
  "Property Data After" = after_property_dimension,
  "Property Data After Feature Engineering" = after_feat_eng_dimension,
  "ACS Data Before" = og_acs_dimension,
  "ACS Data After" = after_acs_dimension,
  "Final Data" = final_dimension
)

# Make Kable of dimensions.
dimensions_kable <- kable(dimensions,
                          col.names = c("Dimensions", "Property Data Before", "Property Data After",
                                        "Property Data After Feature Engineering",
                                        "ACS Data Before", "ACS Data After", "Final Data"),
                          digits = 2,
                          caption = "Before and After Data Dimensions",
                          format.args = list(big.mark = ",")
) %>%
  kable_styling(latex_options = "striped",
                full_width = FALSE) %>%
  column_spec(1, bold = TRUE) %>%
  row_spec(0, color = "#f5f4f0", background = "#ff4100", bold = TRUE)

dimensions_kable
Before and After Data Dimensions
Dimensions Property Data Before Property Data After Property Data After Feature Engineering ACS Data Before ACS Data After Final Data
Rows 560,961 26,347 13,942 28,261 3,446 13,941
Columns 79 16 33 4 7 28

Census:

Using tidycensus, we imported all variables that aligned with our structural data from 2023 ACS data by tracts: median household income, total population, poverty by ethnicity (White, Black, Native American, Asian, Pacific Islander, “Other,” Multiracial), males and females aged 18–65+ with bachelor’s degrees or higher, total vacancy, and total housing units by ethnicity, as well as single-family households and median household income per ethnic group. We compiled the individual poverty, bachelor’s degree, unit, and single-family household counts by ethnicity to form the following percentage variables: total_poverty, total_bach, total_units, and total_single_family.

From this, we calculated pct_poverty, pct_bach, pct_vacant (accounting for ACS errors), pct_single_family, and medhhinc, transforming NAs to 0 for regression analysis.

Using our residential property vector data (which includes structural, spatial, and feature-engineered variables), we performed a left join on the cleaned ACS summary data by GEOID.

After organizing the final dataset so key variables appear first, we generated a kable summarizing the workflow. The final dataset contains 26,344 observations and 29 columns.

Exploratory Data Analysis (EDA)

Visualizations

# Distribution of Sale Prices Histogram

sale_price_hist <- ggplot(final_data, aes(x = sale_price)) +
  geom_histogram(fill = "#6BAED6", color = "white", bins = 25) +
  labs(title = "Distribution of Sale Prices",
       x = "Sale Price ($)",
       y = "Count") +
  scale_x_continuous(labels = scales::dollar) +
  scale_y_continuous(labels = scales::comma) +
  theme_minimal(base_size = 13) +
  theme(
    panel.grid.major = element_line(color = "gray90", size = 0.3),
    panel.grid.minor = element_line(color = "gray95", size = 0.2),
    plot.title = element_text(hjust = 0.5, face = "bold", size = 14, margin = margin(b = 10)),
    axis.title.x = element_text(margin = margin(t = 10)),
    axis.title.y = element_text(margin = margin(r = 10))
    )

sale_price_hist

Distribution of Sales Prices - Histogram

The distribution of sales prices is heavily skewed right, with a high percentage of thee transactions below $500,000, and a relatively small number of homes falling in the multi-million dollar range. This chart also also gives insight into the potential presence of high leverage outliers falling toward the left. A concentration of thousands of sales clustered on the lower end of the price ranges suggests strong segmentation, meaning that the market is split into distinct groups, where you see a large cluster of homes that are relatively similar and a smaller but very different cluster of high-value properties with luxury features and few homes in-between, the market looks like separate groups rather than one gradually increasing scale. It’s important to note that this spread likely indicates heteroskedasticity, and would require a log transformation for statistical inference.

# Geographic distribution map - Sale Price

tmap_mode("plot")

sale_price_map <- tm_shape(philly_neighborhoods) +
  tm_polygons(col = "gray95", border.col = "gray65") +
  tm_shape(final_data) +
  tm_dots(col = "sale_price",
          palette = "YlOrRd",
          size = 0.05,
          style = "quantile",
          title = "Sale Price ($)") +
  tm_layout(main.title = "Geographic Distribution of Sale Prices",
            main.title.position = "center",
            main.title.size = 1.2,
            legend.outside = TRUE,
            frame = FALSE,
            bg.color = "#f5f4f0")

sale_price_map

#tmap_save(tm = sale_price_map, filename = "slide_images/geo-price-map.png", width = 10, height = 6, units = "in", dpi = 300, bg = "transparent")

Geographic Distribution of Sales Prices - Map

This map displays a high variation of sales prices distributed throughout the city. The higher priced clustered areas are the Center City, University City, the riverfront, and affluent pockets of the Northwest, potentially because these areas provide easy access to transit, employment centers, and cultural amenities. The northern area stretching above Broad Street into parts of West and North Philadelphia displays lower-priced homes, potentially reflecting long-term disinvestment, high vacancy rates, and aging non-renovated housing stock. The spatial clustering suggests that sale price is place-dependent in Philadelphia, mostly due to neighborhood qualities, and fixed effects.

# Sale Price vs structural features scatter plot

# Sale Price vs. Square feet 
price_v_sqft_plot <- ggplot(final_data, aes(x = square_feet, y = sale_price)) +
  geom_point(alpha = 0.4, color = "#08519C", size = 1.3) +
  geom_smooth(method = "lm", se = FALSE, color = "red", linewidth = 1) +
  labs(title = "Sale Price vs. Square Feet",
       x = "Square Feet",
       y = "Sale Price ($)") +
  scale_x_continuous(labels = scales::comma) +
  scale_y_continuous(labels = scales::dollar) +
  theme_minimal(base_size = 13) +
  theme(
    panel.background = element_rect(fill = "#f5f4f0"),
    panel.grid.major = element_line(color = "gray90", size = 0.3),
    panel.grid.minor = element_line(color = "gray95", size = 0.2),
    plot.title = element_text(hjust = 0.5, face = "bold", size = 14, margin = margin(b = 10)),
    axis.title.x = element_text(margin = margin(t = 10)),
    axis.title.y = element_text(margin = margin(r = 10))
    )

price_v_sqft_plot

#ggsave("slide_images/price-v-sqft.png", plot = price_v_sqft_plot, width = 6, height = 4, units = "in", dpi = 300)

Sale Price vs. Square Feet

This scatter plot highlights the importance of home size as a structural indicator of value, even though the relationship may not be linear. There is a dense concentration of homes clustered below 3,000 sq ft. and under $500,000 consistent with the sales price distribution. Here we are seeing the same strong skew to the right, displaying a relationship between the two variables. The upward trend is fairly obvious, larger homes = an increase in price, however the spread gets wider as square footage increases, indicating that while square footage is positively associated with price, the weaker relationship among larger properties suggests that additional living space contributes less to value once the home reaches a certain size category.

# Sale Price vs. Number of Fireplaces 

price_v_fire_plot <- ggplot(final_data, aes(x = fireplace_num, y = sale_price)) +
  geom_jitter(alpha = 0.5, color = "#08519C", size = 1.3, width = 0.15, height = 0) +
  geom_smooth(method = "lm", se = FALSE, color = "red", linewidth = 1) +
  labs(title = "Sale Price vs. Number of Fireplaces",
       x = "Number of Fireplaces",
       y = "Sale Price ($)") +
  scale_x_continuous(labels = scales::comma) +
  scale_y_continuous(labels = scales::dollar) +
  theme_minimal(base_size = 13) +
  theme(
    panel.background = element_rect(fill = "#f5f4f0"),
    panel.grid.major = element_line(color = "gray90", size = 0.3),
    panel.grid.minor = element_line(color = "gray95", size = 0.2),
    plot.title = element_text(hjust = 0.5, face = "bold", size = 14, margin = margin(b = 10)),
    axis.title.x = element_text(margin = margin(t = 10)),
    axis.title.y = element_text(margin = margin(r = 10))
    )

price_v_fire_plot

#ggsave("slide_images/price-v-fire.png", plot = price_v_fire_plot, width = 6, height = 4, units = "in", dpi = 300)

Sale Price vs. Number of Fireplaces

This chart displays the positive relationship between the value of a home in relation to the quality and character of aesthetic features. Nearly all homes with no fireplaces remain in the lower-mid price range, and once a home has two or more fire places the sale price increases to a much higher range. Homes in Philadelphia with several fireplaces are usually bigger, older, homes with higher-end finishes, meaning that fireplace count can serve as a secondary and indirect indicator of several other indicators that influence home value, and this is the reason for the high level of noise on the chart.

# Sale Price vs. Spatial Features

# Sale Price vs Distance to city center
price_v_spatial_plot <- ggplot(final_data, aes(x = city_dist_mi, y = sale_price)) +
  geom_point(alpha = 0.4, color = "#41AB5D", size = 1.3) +
  geom_smooth(method = "lm", se = FALSE, color = "red", linewidth = 1) +
  labs(title = "Sale Price vs. Distance to Downtown",
       x = "ln Distance to City Center, mi",
       y = "Sale Price") +
  scale_x_continuous(labels = scales::comma) +
  scale_y_continuous(labels = scales::dollar) +
  theme_minimal(base_size = 13) +
  theme(
    panel.grid.major = element_line(color = "gray90", size = 0.3),
    panel.grid.minor = element_line(color = "gray95", size = 0.2),
    plot.title = element_text(hjust = 0.5, face = "bold", size = 14, margin = margin(b = 10)),
    axis.title.x = element_text(margin = margin(t = 10)),
    axis.title.y = element_text(margin = margin(r = 10))
    )

price_v_spatial_plot

Sale Price vs. Distance to City Center

The plots shows a weak negative relationship between price and distance to the city center. High valued homes fall both close to downtown and well outside of it, highlighting the structure of Philadelphia’s housing market as it relates to location. It’s not a city where closer is always better, instead certain neighborhoods like chestnut hill maintain their high premiums due to neighborhood reputation, and school quality. This pattern suggests that there are many fixed affects at play here, and it is important to note that similar patterns were displayed among many spatial variables.

# Sale Price vs. Distance to Parks - Park Accessibility

price_v_park_plot <- ggplot(final_data, aes(x = parks_mile, y = sale_price)) +
  geom_point(alpha = 0.4, color = "#238B45", size = 1.3) +
  geom_smooth(method = "lm", se = FALSE, color = "red", linewidth = 1) +
  labs(title = "Sale Price vs. Distance to Nearest Park",
       x = "Distance to Nearest Park (mi)",
       y = "Sale Price") +
  scale_x_continuous(labels = scales::comma) +
  scale_y_continuous(labels = scales::dollar) +
  theme_minimal(base_size = 13) +
  theme(
    panel.grid.major = element_line(color = "gray90", size = 0.3),
    panel.grid.minor = element_line(color = "gray95", size = 0.2),
    plot.title = element_text(hjust = 0.5, face = "bold", size = 14, margin = margin(b = 10)),
    axis.title.x = element_text(margin = margin(t = 10)),
    axis.title.y = element_text(margin = margin(r = 10))
    )

price_v_park_plot

Sale Price vs. Distance to Parks - Park Accessibility

The plot suggests that distance to parks has very little trend with sale price. Many high-priced properties sit both very close and very far away from parks, suggesting that park accessibility alone may not hold much value. This could be correlated with the fact that some parks are major attractions; i.e Fairmount Park, while others have limited impact on neighborhood desirability, especially in areas of low investment. Because of the difference in park quality across the city, the distance metric may not represents the true relationship, and more neighborhood or amenity variables may be required to capture environmental quality more accurately.

# Median income vs Sale Price per neighborhood

inc_v_price_plot <- ggplot(final_data, aes(x = medhhinc, y = sale_price, color = neighborhood_fe)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "lm", se = FALSE, color = "black") +
  labs(title = "Relationship Between Median Income and Sale Price by Neighborhood",
       x = "Median Household Income ($)",
       y = "Sale Price") +
  scale_x_continuous(labels = scales::dollar) +
  scale_y_continuous(labels = scales::dollar) +
  scale_color_viridis_d(option = "turbo") +
  theme_minimal(base_size = 13) +
  theme(
    panel.grid.major = element_line(color = "gray90", size = 0.3),
    panel.grid.minor = element_line(color = "gray95", size = 0.2),
    plot.title = element_text(hjust = 0.5, face = "bold", size = 14, margin = margin(b = 10)),
    axis.title.x = element_text(margin = margin(t = 10)),
    axis.title.y = element_text(margin = margin(r = 10)),
    legend.position = "bottom",
    legend.text = element_text(size = 7),
    legend.key.size = unit(0.4, "cm"),
    legend.box = "horizontal"
    ) +
  guides(color = guide_legend(ncol = 8, byrow = TRUE))

inc_v_price_plot

Median Income vs Sale Price per neighborhood

This plot further advances the claim that even with a lot of scatter in the points, there is a noticeable upward trend of homes in higher-income neighborhoods selling for higher prices. Many of the neighborhoods cluster in specific ranges of the price scale, even when income levels are similar, suggesting that the housing market in Philadelphia is dependent not only on demographics or income, but place effects, where reputation or fixed effects in a neighborhood increase or reduce prices. An example of this is some neighborhoods with moderate household incomes still show clusters of high-value transactions, while others with similar incomes remain in the lower end of the market, again highlighting other factors and fixed effects like transit access, historical character, and school quality. The main takeaway is that while higher-income neighborhood tend to have homes with higher sale prices, neighborhood identity also plays a big role in sale price.

# Spatial Relationship Between sale price and structural predictors

tmap_mode("plot")

# Sale Price
map_value <- tm_shape(philly_tracts[philly_tracts$COUNTYFP == 101, ]) +
  tm_polygons(col = "gray90", border.col = "gray45", lwd = 0.5) +
  tm_shape(final_data) +
  tm_dots(col = "sale_price",
          palette = "YlOrRd",
          style = "quantile",
          size = 0.04,
          title = "Sale Price") +
  tm_layout(main.title = "Distribution of Sale Prices",
            main.title.position = "center",
            main.title.size = 1.2,
            legend.outside = TRUE,
            frame = FALSE)
# Bathrooms Predictor

map_bath <-  tm_shape(philly_tracts[philly_tracts$COUNTYFP == 101, ]) +
  tm_polygons(col = "gray90", border.col = "gray45", lwd = 0.5) +
  tm_shape(final_data) +
  tm_dots(col = "bath_num",
          palette = "Blues",
          style = "quantile",
          size = 0.04,
          title = "Number of Bathrooms") +
  tm_layout(main.title = "Distribution of Bathrooms",
            main.title.position = "center",
            main.title.size = 1.2,
            legend.outside = TRUE,
            frame = FALSE)
# Stories Predictor

map_story <-  tm_shape(philly_tracts[philly_tracts$COUNTYFP == 101, ]) +
  tm_polygons(col = "gray90", border.col = "gray45", lwd = 0.5) +
  tm_shape(final_data) +
  tm_dots(col = "story_num",
          palette = "Greens",
          style = "quantile",
          size = 0.04,
          title = "Number of Stories") +
  tm_layout(main.title = "Distribution of Stories",
            main.title.position = "center",
            main.title.size = 1.2,
            legend.outside = TRUE,
            frame = FALSE)
# Display them side by side

tmap_arrange(map_value, map_bath, map_story, ncol = 3)

map_struct <- tmap_arrange(map_value, map_bath, map_story, ncol = 3)

Spatial Relationship Between sale price and structural predictors (bathrooms and stories) - Map

This figure shows how housing characteristics are clustered across Philadelphia. Here we can see the sale price distribution directly against certain structural features. Bathrooms and number of stories follow similar geographic patterns, as sale price, areas with higher sales prices tend to have bigger homes with more bathrooms and more stories. In contrast, neighborhoods where homes have less structural amenities also have subsequently lower sale prices. This group of maps makes a strong visual argument that structural features and neighborhood context evolve together, supporting the modeling strategy of incorporating more structural than spatial predictors.

# Sale price histogram.
price_hist <- ggplot(residential_points, aes(x = sale_price)) +
  geom_histogram(fill = "pink", color = "white") +
  labs(title = "Histogram of Sale Price", x = "Sale Price ($)", y = "Count") +
  theme_minimal()

# Log sale price histogram.
ln_price_hist <- ggplot(residential_points, aes(x = ln_sale_price)) +
  geom_histogram(fill = "pink", color = "white") +
  labs(title = "Histogram of ln(Sale Price)", x = "ln(Sale Price)", y = "Count") +
  theme_minimal()

grid.arrange(price_hist, ln_price_hist, ncol = 2)

The raw distribution of sale prices is right-skewed with a median of 250,000 USD and a mean of 343,867 USD. There is a substantial gap in price between the third quartile (375,000 USD) and the maximum price (15,428,633 USD). While 75% of houses sold for 375,000 USD or less, the upper 25% exhibits considerable variability, with prices ranging up to 15.4 million USD, affecting the tail distribution. This suggests that a small number of luxury properties are affecting the distribution of the sales price data To address this skewness and improve model performance, we performed a log transformation, making our data closer to normal by compressing the scale of higher values, emphasizing a standardized change in percentage over dollar amount.

# Livable space histogram.
livable_area_hist <- ggplot(residential_points, aes(x = square_feet)) +
  geom_histogram(fill = "wheat", color = "white") +
  labs(title = "Histogram of Square Feet", x = "Square Feet", y = "Count") +
  theme_minimal()

# Log livable space histogram.
ln_square_feet_hist <- ggplot(residential_points, aes(x = ln_square_feet)) +
  geom_histogram(fill = "wheat", color = "white") +
  labs(title = "Histogram of ln(Square Feet)", x = "ln(Square Feet)", y = "Count") +
  theme_minimal()

grid.arrange(livable_area_hist, ln_square_feet_hist, ncol = 2)

The distribution of livable area is right-skewed, with a median of 1,216 sq ft and a mean of 1,372 sq ft. While 75% of homes are under 1,509 sq ft, a small number of larger properties extend the upper tail of the distribution. Notably, homes above 3,000 sq ft become increasingly sparse, suggesting a separation between standard and luxury housing markets. We applied a log transformation to this variable to create a more symmetric distribution by compressing the scale of larger homes, which improves linearity in our model and allows coefficients to represent proportional rather than absolute changes in square footage.

# Distance to downtown histogram.
downtown_dist_hist <- ggplot(residential_points, aes(x = city_dist_mi)) +
  geom_histogram(fill = "darkblue", color = "white") +
  labs(title = "Histogram of Downtown Distance", x = "Downtown Distance (mi)", y = "Count") +
  theme_minimal()

# Log distance to downtown histogram.
ln_downtown_dist_hist <- ggplot(residential_points, aes(x = ln_city_dist)) +
  geom_histogram(fill = "darkblue", color = "white") +
  labs(title = "Histogram of ln(Downtown Distance)", x = "ln(Downtown Distance)", y = "Count") +
  theme_minimal()

grid.arrange(downtown_dist_hist, ln_downtown_dist_hist, ncol = 2)

The distribution of distance to Center City (City Hall) is right-skewed, with fewer observations of houses occurring as the distance from Center City/City Hall increases. The effect of distance on housing prices is non-linear: being 1 mile from Center City has a larger impact on price than being 6 vs. 11 miles out. To account for this, we applied a log transformation, which compresses the upper tail, creates a more symmetric distribution, and reduces the influence of extreme distances. This transformation improves linearity in our regression model and allows coefficients to be interpreted as proportional changes in price per proportional change in distance.

# SEPTA buffer histogram.
septa_buffer_hist <- ggplot(residential_points, aes(x = septa_half_mi)) +
  geom_histogram(fill = "azure3", color = "white") +
  labs(title = "Histogram of SEPTA 0.5mi Buffer", x = "SEPTA 0.5mi Buffer (mi)", y = "Count") +
  theme_minimal()

# Log SEPTA buffer histogram.
ln_septa_buffer_hist <- ggplot(residential_points, aes(x = ln_septa_half_mi)) +
  geom_histogram(fill = "azure3", color = "white") +
  labs(title = "Histogram of ln(SEPTA 0.5mi Buffer)", x = "ln(SEPTA 0.5mi Buffer)", y = "Count") +
  theme_minimal()

grid.arrange(septa_buffer_hist, ln_septa_buffer_hist, ncol = 2)

The distribution of SEPTA access within a 0.5-mile buffer of each property is right-skewed, with a median of 44 and a mean of 52.5. While most properties have between 29 and 69 nearby SEPTA access points, there is substantial variation ranging from zero (likely remote suburban properties) to over 160 in the most transit-dense neighborhoods. The log transformation compresses this wide range and produces a more symmetric distribution, which is appropriate given that transit accessibility exhibits diminishing returns—the marginal benefit of additional access decreases as the total number increases.

# kNN amenities histogram.
knn_amenities_hist <- ggplot(residential_points, aes(x = knn_amenity_mi)) +
  geom_histogram(fill = "darkseagreen", color = "white") +
  labs(title = "Histogram of kNN Amenities", x = "kNN Amenities (mi)", y = "Count") +
  theme_minimal()

grid.arrange(knn_amenities_hist)

For the kNN Amenities variable, the mean distance to the nearest amenity per household is 0.31 miles, and the median is 0.27 miles. Seventy-five percent of households are within 0.42 miles of any of the 23 amenities described in the data cleaning section. These statistics showcase Philadelphia’s reputation as a highly walkable city. Observations beyond 1 mile typically reflect suburban or rural settings. We did not transform this variable, instead using the raw distances to preserve the direct relationship between amenity proximity and property values.The kNN approach inherently reflects local amenity density, with shorter distances in denser areas and longer distances where households are more dispersed.

# Distance to park histogram..
parks_dist <- ggplot(residential_points, aes(x = parks_mile)) +
  geom_histogram(fill = "darkgreen", color = "white") +
  labs(title = "Histogram of Parks Distance", x = "Parks Distance (mi)", y = "Count") +
  theme_minimal()

# Log distance to park histogram.
ln_parks_dist <- ggplot(residential_points, aes(x = ln_park_dist)) +
  geom_histogram(fill = "darkgreen", color = "white") +
  labs(title = "Histogram of ln(Parks Distance)", x = "Parks Distance", y = "Count") +
  theme_minimal()

grid.arrange(parks_dist, ln_parks_dist, ncol = 2)

The distribution of the variable Distance from Parks by miles is showing a slight right skew. Between the third quartile and the maximum there is a jump of about 3 miles, indicating outliers beyond 2 miles. Due to this, we chose to log the variable to ensure it is normal for our model.

# Number of bathrooms histogram.
ggplot(residential_points, aes(x = bath_num)) +
  geom_histogram(fill = "gold", color = "white") +
  labs(title = "Histogram of Bathrooms", x = "Bathrooms", y = "Count") +
  theme_minimal()

The histogram showcases the number of observations of properties with n bathrooms. Most observations exhibit 1 or 2 bathrooms per property, with a an outlier of 8 bathrooms.

# Number of fireplaces histogram.
ggplot(residential_points, aes(x = fireplace_num)) +
  geom_histogram(fill = "darkred", color = "white") +
  labs(title = "Histogram of Fireplaces", x = "Fireplaces", y = "Count") +
  theme_minimal()

This histogram showcases the number of observations with n fireplaces. Most observations contain 0 fireplaces, with outliers of property(ies) containing 2 or more.

# Number of garages histogram.
ggplot(residential_points, aes(x = garage_num)) +
  geom_histogram(fill = "gray", color = "white") +
  labs(title = "Histogram of Garages", x = "Garages", y = "Count") +
  theme_minimal()

This histogram showcases the number of garages available per observation. The median is 0 garages per property, with a maximum of 5 garages per property.

# Number of stories histogram.
ggplot(residential_points, aes(x = story_num)) +
  geom_histogram(fill = "orange", color = "white") +
  labs(title = "Histogram of Stories", x = "Stories", y = "Count") +
  theme_minimal()

This histogram showcases the number of observations that contain n housing stories. The median is 2 stories per property, with a maximum of 5 stories.

# Age histogram.
ggplot(residential_points, aes(x = age)) +
  geom_histogram(fill = "black", color = "white") +
  labs(title = "Histogram of Age", x = "Age", y = "Count") +
  theme_minimal()

This histogram showcases the amount of observations with their calculated ages (years) based off year built. The median age is 100 years, and the maximum is 275 years. This was then transformed into a polynomial variable to account for the decrease in housing price as the home ages until a certain age, then the price rises again.

# Histogram for pct_poverty
ggplot(philly_acs_summary, aes(x = pct_poverty)) +
  geom_histogram(fill = "skyblue", color = "white") +
  labs(title = "Histogram of Percent Poverty", x = "Percent Poverty (%)", y = "Count") +
  theme_minimal()

This histogram presents the tracts with their determined percent poverty. The percentage is suspiciously high with the range from the minimum (0%) from the first quartile is 98 percent.We also have 32 tracts with no data.

# Histogram for pct_bach
ggplot(philly_acs_summary, aes(x = pct_bach)) +
  geom_histogram(binwidth = 5, fill = "lightgreen", color = "white") +
  labs(title = "Histogram of Percent Bachelor Degree Holders", x = "Percent Bachelor Degree Holders (%)", y = "Count") +
  theme_minimal()

This histogram shows the distribution of percent of bachelor + degree holders per census tract. The median is 13.5 %, with most tracts falling between the minimum and third quartile. We also have 32 tracts with no data.

# Histogram for pct_vacant
ggplot(philly_acs_summary, aes(x = pct_vacant)) +
  geom_histogram(binwidth = 10, fill = "salmon", color = "white") +
  labs(title = "Histogram of Percent Vacant Units", x = "Percent Vacant (%)", y = "Count") +
  theme_minimal()

This histogram shows the distribution of percent vacant residences within the census tracts. The median percentage of vacant homes is 8 percent. We also have 44 tracts with no data.

# Histogram for pct_single_family
ggplot(philly_acs_summary, aes(x = pct_single_family)) +
  geom_histogram(binwidth = 5, fill = "orange", color = "white") +
  labs(title = "Histogram of Percent Single-Family Homes ", x = "Percent Single-Family (%)", y = "Count") +
  theme_minimal()

This histogram represents the percent of single families per tract. The median for percent single families per tract is 66.58%. We also have 45 tracts with no data.

# Histogram for medhhinc
ggplot(philly_acs_summary, aes(x = medhhinc)) +
  geom_histogram(binwidth = 10000, fill = "purple", color = "white") +
  labs(title = "Histogram of Median Household Income", x = "Median Household Income ($)", y = "Count") +
  theme_minimal()

This histogram represents the median household income value per census tract. The median of the median household income value is $66,795. Slightly right-skewed. We also have 32 tracts with no data.

# Residential property correlation.
corr_matrix_df <- final_data %>%
  select(1:22) %>%
  st_drop_geometry()

corr_matrix <- round(cor(na.omit(corr_matrix_df)), 2)

corr_matrix_plot <- ggcorrplot(corr_matrix,
                               title = "Correlation Matrix",
                               hc.order = FALSE,
                               method = "square",
                               lab = TRUE,
                               lab_size = 3,
                               colors = c("#6D9EC1", "white", "#E46726"),
                               ggtheme = ggplot2::theme_gray,
                               insig = "blank"
                               )

corr_matrix_plot

#ggsave("correlation_matrix.png", width = 8, height = 8, dpi = 400)

5 strongest correlations to price:

  • square feet

  • bathrooms

  • median income

  • percent bachelor’s

  • fireplaces

Seems like bachelor degrees are highly correlated with median income, and should be excluded as a predictor.

Model Building

Model Building Progression

Check for multicollinearity:

# VIF check for multicollinearity.
vif_check <- lm(ln_sale_price ~ ln_square_feet + bath_num + ac_binary + fireplace_num + story_num  + garage_num + ln_septa_half_mi + ln_park_dist + ln_city_dist + basement_type + fuel_type, data = residential_points)

vif(vif_check)
                     GVIF Df GVIF^(1/(2*Df))
ln_square_feet   2.136707  1        1.461748
bath_num         1.898315  1        1.377793
ac_binary        1.327352  1        1.152108
fireplace_num    1.255307  1        1.120405
story_num        1.480789  1        1.216877
garage_num       2.278523  1        1.509478
ln_septa_half_mi 3.024715  1        1.739171
ln_park_dist     1.042159  1        1.020862
ln_city_dist     3.368539  1        1.835358
basement_type    3.101506 10        1.058226
fuel_type        1.049909  3        1.008150
# Build Model step by step
# First Model (structural features only)

first_model <- lm(sale_price ~ ln_square_feet + bath_num + # Structural.
                    ac_binary + fireplace_num + story_num + garage_num + # Structural.
                    basement_type + fuel_type + # Categorical structural.
                    age + I(age^2), # Age polynomial.
                    data = final_data)

summary(first_model)

Call:
lm(formula = sale_price ~ ln_square_feet + bath_num + ac_binary + 
    fireplace_num + story_num + garage_num + basement_type + 
    fuel_type + age + I(age^2), data = final_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-950008  -77477  -12481   56145 4968579 

Coefficients:
                                          Estimate    Std. Error t value
(Intercept)                          -1567082.0223    47057.0173 -33.302
ln_square_feet                         264449.0506     6759.6051  39.122
bath_num                                50964.4976     2845.2442  17.912
ac_binary                               93074.7875     3512.0930  26.501
fireplace_num                          140920.0673     4566.3704  30.860
story_num                               36355.3854     3359.3675  10.822
garage_num                              73558.1435     4141.4904  17.761
basement_typeFull Finished             -53294.3070    10213.5411  -5.218
basement_typeFull Semi-Finished        -64863.3637    11782.6135  -5.505
basement_typeFull Unfinished           -69506.5763    10618.4547  -6.546
basement_typeFull Unknown Finish       -86227.8236    10972.4377  -7.859
basement_typePartial Finished         -119436.7086    10829.3500 -11.029
basement_typePartial Semi-Finished    -129815.4807    11293.9584 -11.494
basement_typePartial Unfinished       -124535.0042    13701.2755  -9.089
basement_typePartial Unknown Finish   -133351.9021    13331.9618 -10.002
basement_typeUnknown Size Finished      64705.0675    49964.3245   1.295
basement_typeUnknown Size Unfinished   -24679.0227    38147.7561  -0.647
fuel_typeElectric                      -10161.5711    10207.1559  -0.996
fuel_typeOil Heat                       -1009.7000    20552.1456  -0.049
fuel_typeOther                         261076.4801    62623.7246   4.169
age                                     -4532.6270      146.2180 -30.999
I(age^2)                                   26.5926        0.8056  33.011
                                                 Pr(>|t|)    
(Intercept)                          < 0.0000000000000002 ***
ln_square_feet                       < 0.0000000000000002 ***
bath_num                             < 0.0000000000000002 ***
ac_binary                            < 0.0000000000000002 ***
fireplace_num                        < 0.0000000000000002 ***
story_num                            < 0.0000000000000002 ***
garage_num                           < 0.0000000000000002 ***
basement_typeFull Finished            0.00000018345852182 ***
basement_typeFull Semi-Finished       0.00000003756885247 ***
basement_typeFull Unfinished          0.00000000006123926 ***
basement_typeFull Unknown Finish      0.00000000000000417 ***
basement_typePartial Finished        < 0.0000000000000002 ***
basement_typePartial Semi-Finished   < 0.0000000000000002 ***
basement_typePartial Unfinished      < 0.0000000000000002 ***
basement_typePartial Unknown Finish  < 0.0000000000000002 ***
basement_typeUnknown Size Finished                  0.195    
basement_typeUnknown Size Unfinished                0.518    
fuel_typeElectric                                   0.319    
fuel_typeOil Heat                                   0.961    
fuel_typeOther                        0.00003078296645867 ***
age                                  < 0.0000000000000002 ***
I(age^2)                             < 0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 176400 on 13919 degrees of freedom
Multiple R-squared:   0.59, Adjusted R-squared:  0.5894 
F-statistic: 953.9 on 21 and 13919 DF,  p-value: < 0.00000000000000022
# Build Model step by step
# Second Model (structural features + census features)

second_model <- lm(sale_price ~ ln_square_feet + bath_num + # Structural.
                    ac_binary + fireplace_num + story_num + garage_num + # Structural.
                    basement_type + fuel_type + # Categorical structural.
                    age + I(age^2) + # Age polynomial.
                    medhhinc + pct_vacant + pct_single_family, # Census feature.
                    data = final_data)

summary(second_model)

Call:
lm(formula = sale_price ~ ln_square_feet + bath_num + ac_binary + 
    fireplace_num + story_num + garage_num + basement_type + 
    fuel_type + age + I(age^2) + medhhinc + pct_vacant + pct_single_family, 
    data = final_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-989902  -65780   -4400   51407 5090214 

Coefficients:
                                          Estimate    Std. Error t value
(Intercept)                          -1624149.7913    43369.8473 -37.449
ln_square_feet                         247961.9374     6273.8891  39.523
bath_num                                57288.9970     2621.0982  21.857
ac_binary                               51722.4377     3347.5574  15.451
fireplace_num                          124755.3989     4241.4924  29.413
story_num                                8419.3065     3179.7289   2.648
garage_num                              78158.9012     3853.0850  20.285
basement_typeFull Finished             -11203.3193     9422.4862  -1.189
basement_typeFull Semi-Finished        -27934.8760    10857.5801  -2.573
basement_typeFull Unfinished           -26654.9384     9792.1690  -2.722
basement_typeFull Unknown Finish       -36033.9701    10129.4450  -3.557
basement_typePartial Finished          -81205.6882     9998.5760  -8.122
basement_typePartial Semi-Finished     -79681.9378    10451.4432  -7.624
basement_typePartial Unfinished        -73329.8274    12642.7079  -5.800
basement_typePartial Unknown Finish    -86122.0344    12300.9671  -7.001
basement_typeUnknown Size Finished     100101.7847    45920.2560   2.180
basement_typeUnknown Size Unfinished    19173.2066    35073.6508   0.547
fuel_typeElectric                        3696.9179     9402.8322   0.393
fuel_typeOil Heat                       -8604.7154    18890.0009  -0.456
fuel_typeOther                         149253.8140    57603.0021   2.591
age                                     -3254.5310      136.7939 -23.791
I(age^2)                                   20.1073        0.7547  26.644
medhhinc                                    2.6437        0.0550  48.071
pct_vacant                                372.3966      219.4913   1.697
pct_single_family                       -1834.8630      167.3421 -10.965
                                                 Pr(>|t|)    
(Intercept)                          < 0.0000000000000002 ***
ln_square_feet                       < 0.0000000000000002 ***
bath_num                             < 0.0000000000000002 ***
ac_binary                            < 0.0000000000000002 ***
fireplace_num                        < 0.0000000000000002 ***
story_num                                        0.008111 ** 
garage_num                           < 0.0000000000000002 ***
basement_typeFull Finished                       0.234461    
basement_typeFull Semi-Finished                  0.010097 *  
basement_typeFull Unfinished                     0.006496 ** 
basement_typeFull Unknown Finish                 0.000376 ***
basement_typePartial Finished        0.000000000000000498 ***
basement_typePartial Semi-Finished   0.000000000000026179 ***
basement_typePartial Unfinished      0.000000006768784183 ***
basement_typePartial Unknown Finish  0.000000000002653313 ***
basement_typeUnknown Size Finished               0.029281 *  
basement_typeUnknown Size Unfinished             0.584624    
fuel_typeElectric                                0.694199    
fuel_typeOil Heat                                0.648744    
fuel_typeOther                                   0.009578 ** 
age                                  < 0.0000000000000002 ***
I(age^2)                             < 0.0000000000000002 ***
medhhinc                             < 0.0000000000000002 ***
pct_vacant                                       0.089788 .  
pct_single_family                    < 0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 162000 on 13916 degrees of freedom
Multiple R-squared:  0.654, Adjusted R-squared:  0.6534 
F-statistic:  1096 on 24 and 13916 DF,  p-value: < 0.00000000000000022
# Build Model step by step
# Third Model (structural features + census features + spatial features)

third_model <- lm(sale_price ~ ln_square_feet + bath_num + # Structural.
                    ac_binary + fireplace_num + story_num + garage_num + # Structural.
                    basement_type + fuel_type + # Categorical structural.
                    age + I(age^2) + # Age polynomial.
                    medhhinc + pct_vacant + pct_single_family + # Census feature.
                    city_dist_mi + septa_half_mi + ln_park_dist + knn_amenity_mi, # Spatial 
                    data = final_data)

summary(third_model)

Call:
lm(formula = sale_price ~ ln_square_feet + bath_num + ac_binary + 
    fireplace_num + story_num + garage_num + basement_type + 
    fuel_type + age + I(age^2) + medhhinc + pct_vacant + pct_single_family + 
    city_dist_mi + septa_half_mi + ln_park_dist + knn_amenity_mi, 
    data = final_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-819394  -62536   -2258   51604 5097715 

Coefficients:
                                           Estimate     Std. Error t value
(Intercept)                          -1747446.62877    42113.83235 -41.493
ln_square_feet                         258006.93589     6017.84941  42.874
bath_num                                51042.12593     2519.99338  20.255
ac_binary                               45917.50180     3218.68122  14.266
fireplace_num                          128996.21070     4059.56014  31.776
story_num                               -3006.92663     3073.33854  -0.978
garage_num                              87539.16415     3697.25870  23.677
basement_typeFull Finished               9070.37214     9051.19859   1.002
basement_typeFull Semi-Finished          -337.29513    10425.15065  -0.032
basement_typeFull Unfinished            -6521.16554     9414.05817  -0.693
basement_typeFull Unknown Finish       -15321.03554     9734.25556  -1.574
basement_typePartial Finished          -37578.40579     9657.71640  -3.891
basement_typePartial Semi-Finished     -34239.59417    10135.57634  -3.378
basement_typePartial Unfinished        -43143.66974    12125.46538  -3.558
basement_typePartial Unknown Finish    -53104.80606    11802.51925  -4.499
basement_typeUnknown Size Finished     125461.12949    43919.14342   2.857
basement_typeUnknown Size Unfinished    37074.12283    33545.52725   1.105
fuel_typeElectric                        3186.00476     8992.43919   0.354
fuel_typeOil Heat                       -1438.39006    18066.82035  -0.080
fuel_typeOther                         104203.82523    55102.10396   1.891
age                                     -2598.74074      132.21683 -19.655
I(age^2)                                   14.75023        0.73757  19.998
medhhinc                                    2.10945        0.05596  37.697
pct_vacant                              -1179.63594      224.04991  -5.265
pct_single_family                         721.42744      182.54422   3.952
city_dist_mi                            -2543.46919      765.70754  -3.322
septa_half_mi                            1849.26010       70.19698  26.344
ln_park_dist                           -16918.37333     2088.82500  -8.099
knn_amenity_mi                         -11286.10624     8531.67917  -1.323
                                                 Pr(>|t|)    
(Intercept)                          < 0.0000000000000002 ***
ln_square_feet                       < 0.0000000000000002 ***
bath_num                             < 0.0000000000000002 ***
ac_binary                            < 0.0000000000000002 ***
fireplace_num                        < 0.0000000000000002 ***
story_num                                        0.327898    
garage_num                           < 0.0000000000000002 ***
basement_typeFull Finished                       0.316304    
basement_typeFull Semi-Finished                  0.974190    
basement_typeFull Unfinished                     0.488506    
basement_typeFull Unknown Finish                 0.115526    
basement_typePartial Finished                    0.000100 ***
basement_typePartial Semi-Finished               0.000732 ***
basement_typePartial Unfinished                  0.000375 ***
basement_typePartial Unknown Finish  0.000006868276779543 ***
basement_typeUnknown Size Finished               0.004288 ** 
basement_typeUnknown Size Unfinished             0.269097    
fuel_typeElectric                                0.723121    
fuel_typeOil Heat                                0.936545    
fuel_typeOther                                   0.058631 .  
age                                  < 0.0000000000000002 ***
I(age^2)                             < 0.0000000000000002 ***
medhhinc                             < 0.0000000000000002 ***
pct_vacant                           0.000000142230197359 ***
pct_single_family                    0.000077861150682202 ***
city_dist_mi                                     0.000897 ***
septa_half_mi                        < 0.0000000000000002 ***
ln_park_dist                         0.000000000000000598 ***
knn_amenity_mi                                   0.185908    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 154900 on 13912 degrees of freedom
Multiple R-squared:  0.6838,    Adjusted R-squared:  0.6832 
F-statistic:  1075 on 28 and 13912 DF,  p-value: < 0.00000000000000022
# Build Model step by step
# Final Model 
## (structural features + census features + spatial features + interactions and fixed effects)

final_model <- lm(sale_price ~ ln_square_feet + bath_num + # Structural.
                    ac_binary + fireplace_num + story_num + garage_num + # Structural.
                    basement_type + fuel_type + # Categorical structural.
                    age + I(age^2) + # Age polynomial.
                    medhhinc + pct_vacant + pct_single_family + # Census feature.
                    city_dist_mi + septa_half_mi + ln_park_dist + knn_amenity_mi + # Spatial 
                    knn_amenity_mi * city_dist_mi + septa_half_mi * city_dist_mi + # Interaction between amenities and downtown distance.
                    neighborhood_fe + quarters_fe, # Fixed effect 
                    data = final_data)

summary(final_model)

Call:
lm(formula = sale_price ~ ln_square_feet + bath_num + ac_binary + 
    fireplace_num + story_num + garage_num + basement_type + 
    fuel_type + age + I(age^2) + medhhinc + pct_vacant + pct_single_family + 
    city_dist_mi + septa_half_mi + ln_park_dist + knn_amenity_mi + 
    knn_amenity_mi * city_dist_mi + septa_half_mi * city_dist_mi + 
    neighborhood_fe + quarters_fe, data = final_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-801160  -47098    2286   44401 5124299 

Coefficients:
                                                 Estimate     Std. Error
(Intercept)                                -1654284.18894    45392.97802
ln_square_feet                               261050.80097     5650.29292
bath_num                                      43921.37742     2303.14540
ac_binary                                     38011.00416     3010.73440
fireplace_num                                 89162.46056     3853.46377
story_num                                     -3873.48325     2927.37581
garage_num                                    77020.21464     3402.33588
basement_typeFull Finished                   101848.53373     8807.33464
basement_typeFull Semi-Finished               92723.21627     9959.03892
basement_typeFull Unfinished                  81600.09026     9109.11835
basement_typeFull Unknown Finish              72479.36671     9358.37403
basement_typePartial Finished                 49995.00240     9313.23663
basement_typePartial Semi-Finished            45965.04796     9819.45186
basement_typePartial Unfinished               38856.74602    11404.12187
basement_typePartial Unknown Finish           31165.32465    11136.63290
basement_typeUnknown Size Finished           167885.88331    39673.73261
basement_typeUnknown Size Unfinished         130311.96717    30345.51351
fuel_typeElectric                             -9754.85941     8147.22176
fuel_typeOil Heat                              3569.21538    16203.70043
fuel_typeOther                                 7347.87084    49551.16723
age                                           -1464.49032      129.99882
I(age^2)                                          5.71393        0.74312
medhhinc                                          0.62837        0.08434
pct_vacant                                     -996.31356      291.17129
pct_single_family                               233.09348      214.36879
city_dist_mi                                  -5092.98523     3213.77891
septa_half_mi                                  1374.67637      193.65365
ln_park_dist                                  -5492.24029     4829.20206
knn_amenity_mi                               -20795.25897    23954.87310
neighborhood_feAcademy Gardens                38531.73687    29020.43682
neighborhood_feAllegheny West                -79880.01869    19713.11304
neighborhood_feAndorra                       -36151.57128    30604.17613
neighborhood_feAston-Woodbridge               37863.26480    33157.16989
neighborhood_feBella Vista                    94984.74254    24491.99348
neighborhood_feBelmont                       -71179.93058    41145.93528
neighborhood_feBrewerytown                   -68683.45398    18815.02781
neighborhood_feBridesburg                      8830.69703    22135.13448
neighborhood_feBurholme                       -9166.50875    36104.12704
neighborhood_feBustleton                      26509.94550    24613.84632
neighborhood_feCarroll Park                  -78564.59820    20497.24375
neighborhood_feCedar Park                     48531.03520    21462.08158
neighborhood_feCedarbrook                    -17920.57303    23491.69295
neighborhood_feChestnut Hill                 406065.51580    20949.76540
neighborhood_feClearview                     -82457.90913    31970.65530
neighborhood_feCobbs Creek                   -76936.57377    16128.01800
neighborhood_feDickinson Narrows             -42561.54841    20633.48271
neighborhood_feDunlap                       -147554.74274    34818.07584
neighborhood_feEast Germantown               -53463.37458    22243.14292
neighborhood_feEast Kensington               -22113.94415    18584.19135
neighborhood_feEast Mount Airy               -38481.90384    18266.64290
neighborhood_feEast Oak Lane                -109101.54227    29156.25414
neighborhood_feEast Parkside                 -95781.92049    42973.73591
neighborhood_feEast Passyunk                  22394.58128    19728.66020
neighborhood_feEastwick                      -63029.68746    34386.30873
neighborhood_feElmwood                       -66256.72445    18064.61421
neighborhood_feFairhill                      -70169.07014    37385.71421
neighborhood_feFairmount                      84031.34160    18096.23139
neighborhood_feFeltonville                   -63005.42533    19807.93035
neighborhood_feFern Rock                     -78814.78831    34354.26616
neighborhood_feFishtown - Lower Kensington    12033.86510    15437.23432
neighborhood_feFitler Square                 437759.17782    30630.82608
neighborhood_feFox Chase                         39.75845    21393.39459
neighborhood_feFrancisville                  -59579.46728    23494.58410
neighborhood_feFrankford                     -64341.21360    18564.44342
neighborhood_feFranklinville                -133192.33904    29355.32653
neighborhood_feGermantown - Morton           -71897.43283    26470.62856
neighborhood_feGermantown - Penn Knox       -133998.47683    44315.50890
neighborhood_feGermantown - Westside         -95988.41377    39300.13662
neighborhood_feGermany Hill                   12489.77650    26080.16843
neighborhood_feGirard Estates                -27376.51073    18899.90688
neighborhood_feGlenwood                     -117054.57609    28206.88493
neighborhood_feGraduate Hospital             100111.85768    19917.23425
neighborhood_feGrays Ferry                   -73102.98484    17405.48499
neighborhood_feGreenwich                     -60969.08492    25194.75207
neighborhood_feHaddington                    -90097.32445    18361.68936
neighborhood_feHarrowgate                    -61260.91053    18942.76005
neighborhood_feHartranft                    -129737.46866    22605.75343
neighborhood_feHawthorne                      18630.24561    28800.25686
neighborhood_feHolmesburg                    -16100.82363    20509.45245
neighborhood_feHunting Park                  -50557.40906    22023.90711
neighborhood_feJuniata Park                  -49096.34401    17784.43309
neighborhood_feKingsessing                   -98980.57195    17174.17982
neighborhood_feLawndale                      -37692.71608    17636.65578
neighborhood_feLexington Park                 23524.13125    28487.46833
neighborhood_feLogan                        -100294.04172    19795.94945
neighborhood_feLogan Square                  385654.46595    30122.96764
neighborhood_feLower Moyamensing             -62443.95008    17456.32947
neighborhood_feManayunk                       31140.10822    18159.13838
neighborhood_feMantua                        -89984.78429    27382.69293
neighborhood_feMayfair                       -10322.70901    18287.98642
neighborhood_feMelrose Park Gardens          -66732.92057    31985.13877
neighborhood_feMill Creek                   -107673.65454    26150.79584
neighborhood_feMillbrook                      17202.84227    30504.35776
neighborhood_feModena                         56408.87055    28478.39817
neighborhood_feMorrell Park                   26579.70566    27888.03076
neighborhood_feNewbold                       -50933.70247    21767.37833
neighborhood_feNicetown                      -64789.39380    36265.05877
neighborhood_feNormandy Village              156296.14689    49077.70856
neighborhood_feNorth Central                 -99452.52392    22149.07480
neighborhood_feNorthern Liberties              5127.70680    19219.97164
neighborhood_feNorthwood                     -95831.11489    23307.47163
neighborhood_feOgontz                        -36947.20500    20096.61768
neighborhood_feOld City                      447625.23213    45162.09432
neighborhood_feOld Kensington                -44456.83078    20321.37495
neighborhood_feOlney                         -66547.81346    17113.10521
neighborhood_feOverbrook                     -92461.31622    16952.16723
neighborhood_feOxford Circle                 -21945.67532    17778.31303
neighborhood_fePacker Park                    12279.39865    25652.53959
neighborhood_feParkwood Manor                 52262.59747    29490.04577
neighborhood_fePaschall                      -75915.60507    19723.60401
neighborhood_fePassyunk Square                 8241.67380    20567.73016
neighborhood_fePennsport                     -29228.97667    19630.22906
neighborhood_fePennypack                        596.48399    23865.79987
neighborhood_fePennypack Woods                11158.96146    46682.77751
neighborhood_fePenrose                       -57105.63776    30773.25200
neighborhood_fePoint Breeze                  -59682.56695    17674.51307
neighborhood_feQueen Village                 112754.54516    21195.04932
neighborhood_feRhawnhurst                     13562.96366    21702.29629
neighborhood_feRichmond                      -27599.81183    15321.69715
neighborhood_feRittenhouse                   440406.48964    26289.78879
neighborhood_feRoxborough                     12251.02685    16098.01520
neighborhood_feRoxborough Park               -31517.21994    32778.30320
neighborhood_feSharswood                     -94162.29238    41240.70248
neighborhood_feSociety Hill                  356679.71839    26084.50166
neighborhood_feSomerton                       58144.46551    29038.97783
neighborhood_feSouthwest Germantown         -102902.21244    24678.27855
neighborhood_feSouthwest Schuylkill          -88512.12207    20947.20616
neighborhood_feSpring Garden                 184086.78776    25813.03750
neighborhood_feSpruce Hill                   136984.68804    30030.48304
neighborhood_feStadium District              -21490.74394    21952.64329
neighborhood_feStanton                      -148937.06300    20429.18329
neighborhood_feStrawberry Mansion           -105855.73497    20331.99581
neighborhood_feSummerdale                    -39338.42477    21646.43150
neighborhood_feTacony                        -36131.85212    19822.34575
neighborhood_feTioga                        -117410.89451    23321.96176
neighborhood_feTorresdale                    -22123.42518    27026.31718
neighborhood_feUpper Kensington              -94608.70984    18310.03157
neighborhood_feUpper Roxborough              -37275.44606    20171.34682
neighborhood_feWalnut Hill                   -55367.31014    30171.91092
neighborhood_feWashington Square West         96383.27339    29880.03962
neighborhood_feWest Central Germantown        18454.74064    27851.43559
neighborhood_feWest Kensington               -66033.62559    20282.48524
neighborhood_feWest Mount Airy                35678.92076    18353.74254
neighborhood_feWest Oak Lane                 -43938.98249    18217.28671
neighborhood_feWest Passyunk                 -76472.97091    19732.77514
neighborhood_feWest Poplar                  -106465.93596    43519.59747
neighborhood_feWest Powelton                 -95741.71043    33698.62704
neighborhood_feWhitman                       -40508.68137    18235.13117
neighborhood_feWinchester Park                39365.43842    34682.23181
neighborhood_feWissahickon                    -3320.97684    20167.55191
neighborhood_feWissahickon Hills              13849.88136    32347.39210
neighborhood_feWissinoming                   -19969.84133    18291.53504
neighborhood_feWister                        -66623.80465    30960.92716
neighborhood_feWynnefield                    -98506.70655    20091.53373
neighborhood_feSmall Neighborhoods           -24320.64344    19014.46524
quarters_fe2                                   3462.54510     3371.53444
quarters_fe3                                   4367.64570     3431.38250
quarters_fe4                                   5624.31315     3556.90478
city_dist_mi:knn_amenity_mi                    1746.43066     2637.10048
city_dist_mi:septa_half_mi                     -324.87922       41.36458
                                           t value             Pr(>|t|)    
(Intercept)                                -36.444 < 0.0000000000000002 ***
ln_square_feet                              46.201 < 0.0000000000000002 ***
bath_num                                    19.070 < 0.0000000000000002 ***
ac_binary                                   12.625 < 0.0000000000000002 ***
fireplace_num                               23.138 < 0.0000000000000002 ***
story_num                                   -1.323             0.185793    
garage_num                                  22.637 < 0.0000000000000002 ***
basement_typeFull Finished                  11.564 < 0.0000000000000002 ***
basement_typeFull Semi-Finished              9.310 < 0.0000000000000002 ***
basement_typeFull Unfinished                 8.958 < 0.0000000000000002 ***
basement_typeFull Unknown Finish             7.745  0.00000000000001023 ***
basement_typePartial Finished                5.368  0.00000008083011140 ***
basement_typePartial Semi-Finished           4.681  0.00000288170975129 ***
basement_typePartial Unfinished              3.407             0.000658 ***
basement_typePartial Unknown Finish          2.798             0.005142 ** 
basement_typeUnknown Size Finished           4.232  0.00002334701292127 ***
basement_typeUnknown Size Unfinished         4.294  0.00001764647882630 ***
fuel_typeElectric                           -1.197             0.231201    
fuel_typeOil Heat                            0.220             0.825663    
fuel_typeOther                               0.148             0.882117    
age                                        -11.265 < 0.0000000000000002 ***
I(age^2)                                     7.689  0.00000000000001581 ***
medhhinc                                     7.450  0.00000000000009886 ***
pct_vacant                                  -3.422             0.000624 ***
pct_single_family                            1.087             0.276902    
city_dist_mi                                -1.585             0.113050    
septa_half_mi                                7.099  0.00000000000132160 ***
ln_park_dist                                -1.137             0.255434    
knn_amenity_mi                              -0.868             0.385354    
neighborhood_feAcademy Gardens               1.328             0.184284    
neighborhood_feAllegheny West               -4.052  0.00005103271376971 ***
neighborhood_feAndorra                      -1.181             0.237519    
neighborhood_feAston-Woodbridge              1.142             0.253502    
neighborhood_feBella Vista                   3.878             0.000106 ***
neighborhood_feBelmont                      -1.730             0.083664 .  
neighborhood_feBrewerytown                  -3.650             0.000263 ***
neighborhood_feBridesburg                    0.399             0.689940    
neighborhood_feBurholme                     -0.254             0.799584    
neighborhood_feBustleton                     1.077             0.281484    
neighborhood_feCarroll Park                 -3.833             0.000127 ***
neighborhood_feCedar Park                    2.261             0.023760 *  
neighborhood_feCedarbrook                   -0.763             0.445568    
neighborhood_feChestnut Hill                19.383 < 0.0000000000000002 ***
neighborhood_feClearview                    -2.579             0.009914 ** 
neighborhood_feCobbs Creek                  -4.770  0.00000185774531396 ***
neighborhood_feDickinson Narrows            -2.063             0.039156 *  
neighborhood_feDunlap                       -4.238  0.00002271107888622 ***
neighborhood_feEast Germantown              -2.404             0.016248 *  
neighborhood_feEast Kensington              -1.190             0.234093    
neighborhood_feEast Mount Airy              -2.107             0.035164 *  
neighborhood_feEast Oak Lane                -3.742             0.000183 ***
neighborhood_feEast Parkside                -2.229             0.025840 *  
neighborhood_feEast Passyunk                 1.135             0.256341    
neighborhood_feEastwick                     -1.833             0.066826 .  
neighborhood_feElmwood                      -3.668             0.000246 ***
neighborhood_feFairhill                     -1.877             0.060554 .  
neighborhood_feFairmount                     4.644  0.00000345582237230 ***
neighborhood_feFeltonville                  -3.181             0.001472 ** 
neighborhood_feFern Rock                    -2.294             0.021795 *  
neighborhood_feFishtown - Lower Kensington   0.780             0.435678    
neighborhood_feFitler Square                14.291 < 0.0000000000000002 ***
neighborhood_feFox Chase                     0.002             0.998517    
neighborhood_feFrancisville                 -2.536             0.011227 *  
neighborhood_feFrankford                    -3.466             0.000530 ***
neighborhood_feFranklinville                -4.537  0.00000574751675728 ***
neighborhood_feGermantown - Morton          -2.716             0.006613 ** 
neighborhood_feGermantown - Penn Knox       -3.024             0.002501 ** 
neighborhood_feGermantown - Westside        -2.442             0.014601 *  
neighborhood_feGermany Hill                  0.479             0.632018    
neighborhood_feGirard Estates               -1.448             0.147500    
neighborhood_feGlenwood                     -4.150  0.00003346791368968 ***
neighborhood_feGraduate Hospital             5.026  0.00000050605976280 ***
neighborhood_feGrays Ferry                  -4.200  0.00002685966247264 ***
neighborhood_feGreenwich                    -2.420             0.015537 *  
neighborhood_feHaddington                   -4.907  0.00000093627417988 ***
neighborhood_feHarrowgate                   -3.234             0.001224 ** 
neighborhood_feHartranft                    -5.739  0.00000000971642614 ***
neighborhood_feHawthorne                     0.647             0.517722    
neighborhood_feHolmesburg                   -0.785             0.432441    
neighborhood_feHunting Park                 -2.296             0.021715 *  
neighborhood_feJuniata Park                 -2.761             0.005777 ** 
neighborhood_feKingsessing                  -5.763  0.00000000842319059 ***
neighborhood_feLawndale                     -2.137             0.032601 *  
neighborhood_feLexington Park                0.826             0.408948    
neighborhood_feLogan                        -5.066  0.00000041067148077 ***
neighborhood_feLogan Square                 12.803 < 0.0000000000000002 ***
neighborhood_feLower Moyamensing            -3.577             0.000349 ***
neighborhood_feManayunk                      1.715             0.086396 .  
neighborhood_feMantua                       -3.286             0.001018 ** 
neighborhood_feMayfair                      -0.564             0.572455    
neighborhood_feMelrose Park Gardens         -2.086             0.036963 *  
neighborhood_feMill Creek                   -4.117  0.00003853806950985 ***
neighborhood_feMillbrook                     0.564             0.572799    
neighborhood_feModena                        1.981             0.047638 *  
neighborhood_feMorrell Park                  0.953             0.340563    
neighborhood_feNewbold                      -2.340             0.019303 *  
neighborhood_feNicetown                     -1.787             0.074032 .  
neighborhood_feNormandy Village              3.185             0.001452 ** 
neighborhood_feNorth Central                -4.490  0.00000717529832217 ***
neighborhood_feNorthern Liberties            0.267             0.789634    
neighborhood_feNorthwood                    -4.112  0.00003951963056112 ***
neighborhood_feOgontz                       -1.838             0.066013 .  
neighborhood_feOld City                      9.912 < 0.0000000000000002 ***
neighborhood_feOld Kensington               -2.188             0.028709 *  
neighborhood_feOlney                        -3.889             0.000101 ***
neighborhood_feOverbrook                    -5.454  0.00000005002864661 ***
neighborhood_feOxford Circle                -1.234             0.217072    
neighborhood_fePacker Park                   0.479             0.632173    
neighborhood_feParkwood Manor                1.772             0.076382 .  
neighborhood_fePaschall                     -3.849             0.000119 ***
neighborhood_fePassyunk Square               0.401             0.688641    
neighborhood_fePennsport                    -1.489             0.136516    
neighborhood_fePennypack                     0.025             0.980061    
neighborhood_fePennypack Woods               0.239             0.811080    
neighborhood_fePenrose                      -1.856             0.063519 .  
neighborhood_fePoint Breeze                 -3.377             0.000735 ***
neighborhood_feQueen Village                 5.320  0.00000010547567391 ***
neighborhood_feRhawnhurst                    0.625             0.532011    
neighborhood_feRichmond                     -1.801             0.071669 .  
neighborhood_feRittenhouse                  16.752 < 0.0000000000000002 ***
neighborhood_feRoxborough                    0.761             0.446654    
neighborhood_feRoxborough Park              -0.962             0.336304    
neighborhood_feSharswood                    -2.283             0.022432 *  
neighborhood_feSociety Hill                 13.674 < 0.0000000000000002 ***
neighborhood_feSomerton                      2.002             0.045273 *  
neighborhood_feSouthwest Germantown         -4.170  0.00003068011345479 ***
neighborhood_feSouthwest Schuylkill         -4.225  0.00002399598388643 ***
neighborhood_feSpring Garden                 7.132  0.00000000000104196 ***
neighborhood_feSpruce Hill                   4.562  0.00000512223155475 ***
neighborhood_feStadium District             -0.979             0.327617    
neighborhood_feStanton                      -7.290  0.00000000000032585 ***
neighborhood_feStrawberry Mansion           -5.206  0.00000019534831687 ***
neighborhood_feSummerdale                   -1.817             0.069190 .  
neighborhood_feTacony                       -1.823             0.068358 .  
neighborhood_feTioga                        -5.034  0.00000048552769706 ***
neighborhood_feTorresdale                   -0.819             0.413036    
neighborhood_feUpper Kensington             -5.167  0.00000024114846505 ***
neighborhood_feUpper Roxborough             -1.848             0.064632 .  
neighborhood_feWalnut Hill                  -1.835             0.066518 .  
neighborhood_feWashington Square West        3.226             0.001260 ** 
neighborhood_feWest Central Germantown       0.663             0.507589    
neighborhood_feWest Kensington              -3.256             0.001134 ** 
neighborhood_feWest Mount Airy               1.944             0.051921 .  
neighborhood_feWest Oak Lane                -2.412             0.015881 *  
neighborhood_feWest Passyunk                -3.875             0.000107 ***
neighborhood_feWest Poplar                  -2.446             0.014442 *  
neighborhood_feWest Powelton                -2.841             0.004502 ** 
neighborhood_feWhitman                      -2.221             0.026336 *  
neighborhood_feWinchester Park               1.135             0.256382    
neighborhood_feWissahickon                  -0.165             0.869207    
neighborhood_feWissahickon Hills             0.428             0.668541    
neighborhood_feWissinoming                  -1.092             0.274961    
neighborhood_feWister                       -2.152             0.031425 *  
neighborhood_feWynnefield                   -4.903  0.00000095509695883 ***
neighborhood_feSmall Neighborhoods          -1.279             0.200897    
quarters_fe2                                 1.027             0.304441    
quarters_fe3                                 1.273             0.203092    
quarters_fe4                                 1.581             0.113847    
city_dist_mi:knn_amenity_mi                  0.662             0.507819    
city_dist_mi:septa_half_mi                  -7.854  0.00000000000000432 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 138200 on 13781 degrees of freedom
Multiple R-squared:  0.7506,    Adjusted R-squared:  0.7477 
F-statistic: 260.8 on 159 and 13781 DF,  p-value: < 0.00000000000000022
# Step model to check for most influential predictors.
step_model <- step(final_model, direction = "both")
Start:  AIC=330190.3
sale_price ~ ln_square_feet + bath_num + ac_binary + fireplace_num + 
    story_num + garage_num + basement_type + fuel_type + age + 
    I(age^2) + medhhinc + pct_vacant + pct_single_family + city_dist_mi + 
    septa_half_mi + ln_park_dist + knn_amenity_mi + knn_amenity_mi * 
    city_dist_mi + septa_half_mi * city_dist_mi + neighborhood_fe + 
    quarters_fe

                               Df      Sum of Sq             RSS    AIC
- fuel_type                     3    28971983067 263370680744177 330186
- quarters_fe                   3    52442918807 263394151679917 330187
- city_dist_mi:knn_amenity_mi   1     8380853590 263350089614701 330189
- pct_single_family             1    22593110340 263364301871450 330190
- ln_park_dist                  1    24716514199 263366425275309 330190
- story_num                     1    33456870852 263375165631963 330190
<none>                                           263341708761111 330190
- pct_vacant                    1   223734984996 263565443746107 330200
- medhhinc                      1  1060609290627 264402318051737 330244
- I(age^2)                      1  1129786160856 264471494921967 330248
- city_dist_mi:septa_half_mi    1  1178760341776 264520469102886 330251
- age                           1  2425119284452 265766828045563 330316
- ac_binary                     1  3045879467431 266387588228541 330349
- basement_type                10  5161689474999 268503398236109 330441
- bath_num                      1  6949412767494 270291121528605 330551
- garage_num                    1  9792509392556 273134218153667 330697
- fireplace_num                 1 10230584502254 273572293263365 330720
- ln_square_feet                1 40789375074412 304131083835522 332196
- neighborhood_fe             126 55181544433718 318523253194828 332591

Step:  AIC=330185.9
sale_price ~ ln_square_feet + bath_num + ac_binary + fireplace_num + 
    story_num + garage_num + basement_type + age + I(age^2) + 
    medhhinc + pct_vacant + pct_single_family + city_dist_mi + 
    septa_half_mi + ln_park_dist + knn_amenity_mi + neighborhood_fe + 
    quarters_fe + city_dist_mi:knn_amenity_mi + city_dist_mi:septa_half_mi

                               Df      Sum of Sq             RSS    AIC
- quarters_fe                   3    51438746552 263422119490729 330183
- city_dist_mi:knn_amenity_mi   1     8631162534 263379311906712 330184
- pct_single_family             1    22356315108 263393037059286 330185
- ln_park_dist                  1    25102946515 263395783690693 330185
- story_num                     1    33096011182 263403776755359 330186
<none>                                           263370680744177 330186
+ fuel_type                     3    28971983067 263341708761111 330190
- pct_vacant                    1   228032586045 263598713330222 330196
- medhhinc                      1  1066157006124 264436837750302 330240
- I(age^2)                      1  1128629640995 264499310385172 330243
- city_dist_mi:septa_half_mi    1  1182720308258 264553401052436 330246
- age                           1  2419855498601 265790536242778 330311
- ac_binary                     1  3110039298800 266480720042978 330348
- basement_type                10  5187015536566 268557696280743 330438
- bath_num                      1  6930786670163 270301467414341 330546
- garage_num                    1  9815247503810 273185928247988 330694
- fireplace_num                 1 10269983164027 273640663908205 330717
- ln_square_feet                1 41005558526594 304376239270772 332201
- neighborhood_fe             126 55221519394647 318592200138824 332588

Step:  AIC=330182.6
sale_price ~ ln_square_feet + bath_num + ac_binary + fireplace_num + 
    story_num + garage_num + basement_type + age + I(age^2) + 
    medhhinc + pct_vacant + pct_single_family + city_dist_mi + 
    septa_half_mi + ln_park_dist + knn_amenity_mi + neighborhood_fe + 
    city_dist_mi:knn_amenity_mi + city_dist_mi:septa_half_mi

                               Df      Sum of Sq             RSS    AIC
- city_dist_mi:knn_amenity_mi   1     8606948099 263430726438828 330181
- pct_single_family             1    22818488137 263444937978866 330182
- ln_park_dist                  1    24835965877 263446955456606 330182
- story_num                     1    33262241085 263455381731814 330182
<none>                                           263422119490729 330183
+ quarters_fe                   3    51438746552 263370680744177 330186
+ fuel_type                     3    27967810812 263394151679917 330187
- pct_vacant                    1   229602625324 263651722116053 330193
- medhhinc                      1  1068672414764 264490791905494 330237
- I(age^2)                      1  1129514641993 264551634132722 330240
- city_dist_mi:septa_half_mi    1  1187476136232 264609595626962 330243
- age                           1  2423485456386 265845604947115 330308
- ac_binary                     1  3111930285728 266534049776457 330344
- basement_type                10  5170623553542 268592743044271 330434
- bath_num                      1  6929313581527 270351433072256 330543
- garage_num                    1  9799535331347 273221654822076 330690
- fireplace_num                 1 10270559696819 273692679187548 330714
- ln_square_feet                1 41001940725351 304424060216080 332197
- neighborhood_fe             126 55232719870485 318654839361214 332584

Step:  AIC=330181
sale_price ~ ln_square_feet + bath_num + ac_binary + fireplace_num + 
    story_num + garage_num + basement_type + age + I(age^2) + 
    medhhinc + pct_vacant + pct_single_family + city_dist_mi + 
    septa_half_mi + ln_park_dist + knn_amenity_mi + neighborhood_fe + 
    city_dist_mi:septa_half_mi

                               Df      Sum of Sq             RSS    AIC
- knn_amenity_mi                1     8275874573 263439002313401 330179
- pct_single_family             1    22276767045 263453003205873 330180
- ln_park_dist                  1    24658679744 263455385118572 330180
- story_num                     1    31746414274 263462472853103 330181
<none>                                           263430726438828 330181
+ city_dist_mi:knn_amenity_mi   1     8606948099 263422119490729 330183
+ quarters_fe                   3    51414532116 263379311906712 330184
+ fuel_type                     3    28216037828 263402510401001 330186
- pct_vacant                    1   225408632658 263656135071486 330191
- medhhinc                      1  1123021598624 264553748037452 330238
- I(age^2)                      1  1131258749841 264561985188669 330239
- city_dist_mi:septa_half_mi    1  1375443786743 264806170225571 330252
- age                           1  2432052513712 265862778952540 330307
- ac_binary                     1  3115169114995 266545895553823 330343
- basement_type                10  5169809471535 268600535910363 330432
- bath_num                      1  6921443930483 270352170369311 330541
- garage_num                    1  9791956949448 273222683388276 330688
- fireplace_num                 1 10262474849952 273693201288780 330712
- ln_square_feet                1 41016117393234 304446843832062 332196
- neighborhood_fe             126 55306033620702 318736760059530 332586

Step:  AIC=330179.5
sale_price ~ ln_square_feet + bath_num + ac_binary + fireplace_num + 
    story_num + garage_num + basement_type + age + I(age^2) + 
    medhhinc + pct_vacant + pct_single_family + city_dist_mi + 
    septa_half_mi + ln_park_dist + neighborhood_fe + city_dist_mi:septa_half_mi

                              Df      Sum of Sq             RSS    AIC
- pct_single_family            1    20371570192 263459373883594 330179
- ln_park_dist                 1    26289516653 263465291830055 330179
- story_num                    1    31323480363 263470325793764 330179
<none>                                          263439002313401 330179
+ knn_amenity_mi               1     8275874573 263430726438828 330181
+ quarters_fe                  3    51784130284 263387218183117 330183
+ fuel_type                    3    28284147736 263410718165666 330184
- pct_vacant                   1   224625913876 263663628227278 330189
- I(age^2)                     1  1129724742712 264568727056113 330237
- medhhinc                     1  1151743728564 264590746041965 330238
- city_dist_mi:septa_half_mi   1  1375812286411 264814814599812 330250
- age                          1  2430660522140 265869662835542 330306
- ac_binary                    1  3113192846563 266552195159964 330341
- basement_type               10  5163488384522 268602490697923 330430
- bath_num                     1  6921530951663 270360533265064 330539
- garage_num                   1  9784284772736 273223287086137 330686
- fireplace_num                1 10254426787066 273693429100468 330710
- ln_square_feet               1 41028357650651 304467359964053 332195
- neighborhood_fe            126 55612146862440 319051149175841 332598

Step:  AIC=330178.5
sale_price ~ ln_square_feet + bath_num + ac_binary + fireplace_num + 
    story_num + garage_num + basement_type + age + I(age^2) + 
    medhhinc + pct_vacant + city_dist_mi + septa_half_mi + ln_park_dist + 
    neighborhood_fe + city_dist_mi:septa_half_mi

                              Df      Sum of Sq             RSS    AIC
- ln_park_dist                 1    28008167926 263487382051520 330178
- story_num                    1    33377928201 263492751811794 330178
<none>                                          263459373883594 330179
+ pct_single_family            1    20371570192 263439002313401 330179
+ knn_amenity_mi               1     6370677720 263453003205873 330180
+ quarters_fe                  3    52178014448 263407195869145 330182
+ fuel_type                    3    28015685197 263431358198397 330183
- pct_vacant                   1   222024320821 263681398204415 330188
- I(age^2)                     1  1131582978391 264590956861985 330236
- medhhinc                     1  1212709882326 264672083765920 330241
- city_dist_mi:septa_half_mi   1  1413564197209 264872938080802 330251
- age                          1  2430443389801 265889817273395 330305
- ac_binary                    1  3123603493943 266582977377537 330341
- basement_type               10  5230492366853 268689866250446 330433
- bath_num                     1  6933854635846 270393228519440 330539
- garage_num                   1  9910082534112 273369456417706 330691
- fireplace_num                1 10342674149940 273802048033533 330713
- ln_square_feet               1 41223255146550 304682629030144 332203
- neighborhood_fe            126 55592087203142 319051461086735 332596

Step:  AIC=330178
sale_price ~ ln_square_feet + bath_num + ac_binary + fireplace_num + 
    story_num + garage_num + basement_type + age + I(age^2) + 
    medhhinc + pct_vacant + city_dist_mi + septa_half_mi + neighborhood_fe + 
    city_dist_mi:septa_half_mi

                              Df      Sum of Sq             RSS    AIC
- story_num                    1    34567303780 263521949355299 330178
<none>                                          263487382051520 330178
+ ln_park_dist                 1    28008167926 263459373883594 330179
+ pct_single_family            1    22090221465 263465291830055 330179
+ knn_amenity_mi               1     7780794800 263479601256720 330180
+ quarters_fe                  3    51944536701 263435437514819 330181
+ fuel_type                    3    28413376447 263458968675072 330183
- pct_vacant                   1   231591919878 263718973971398 330188
- I(age^2)                     1  1128240711862 264615622763381 330236
- medhhinc                     1  1206516971008 264693899022527 330240
- city_dist_mi:septa_half_mi   1  1465170936441 264952552987960 330253
- age                          1  2423721998751 265911104050271 330304
- ac_binary                    1  3104119094203 266591501145723 330339
- basement_type               10  5226752829971 268714134881491 330432
- bath_num                     1  6931912763429 270419294814949 330538
- garage_num                   1  9899865999516 273387248051035 330690
- fireplace_num                1 10319948107167 273807330158687 330712
- ln_square_feet               1 41278297289578 304765679341097 332205
- neighborhood_fe            126 56845145187566 320332527239085 332649

Step:  AIC=330177.9
sale_price ~ ln_square_feet + bath_num + ac_binary + fireplace_num + 
    garage_num + basement_type + age + I(age^2) + medhhinc + 
    pct_vacant + city_dist_mi + septa_half_mi + neighborhood_fe + 
    city_dist_mi:septa_half_mi

                              Df      Sum of Sq             RSS    AIC
<none>                                          263521949355299 330178
+ story_num                    1    34567303780 263487382051520 330178
+ ln_park_dist                 1    29197543505 263492751811795 330178
+ pct_single_family            1    24302604362 263497646750937 330179
+ knn_amenity_mi               1     7287627295 263514661728004 330179
+ quarters_fe                  3    52106730347 263469842624952 330181
+ fuel_type                    3    27994771870 263493954583429 330182
- pct_vacant                   1   229126001253 263751075356553 330188
- I(age^2)                     1  1109829719410 264631779074709 330234
- medhhinc                     1  1201108168370 264723057523669 330239
- city_dist_mi:septa_half_mi   1  1487964999450 265009914354749 330254
- age                          1  2418574306245 265940523661544 330303
- ac_binary                    1  3094917492824 266616866848123 330339
- basement_type               10  5193569890594 268715519245894 330430
- bath_num                     1  6910999322629 270432948677929 330537
- garage_num                   1 10115246601282 273637195956582 330701
- fireplace_num                1 10360467578423 273882416933722 330713
- ln_square_feet               1 46797619123932 310319568479232 332455
- neighborhood_fe            126 56811652401070 320333601756370 332648
# Compute TSS
y <- model.response(model.frame(step_model))
tss <- sum((y - mean(y))^2)

# Sequential (Type I) Sum Sq → ΔR²
anova_model <- anova(step_model)
anova_model <- anova_model[!is.na(anova_model$"Sum Sq"), , drop = FALSE]
seq_imp <- transform(anova_model,
                     term = rownames(anova_model),
                     delta_R2 = `Sum Sq` / tss)

# Get top 5 predictors
seq_top4 <- seq_imp[order(-seq_imp$delta_R2), c("term", "Df", "delta_R2")]
head(seq_top4, 5)
                           term    Df   delta_R2
ln_square_feet   ln_square_feet     1 0.40671448
Residuals             Residuals 13792 0.24958298
neighborhood_fe neighborhood_fe   126 0.06706042
medhhinc               medhhinc     1 0.06389808
bath_num               bath_num     1 0.05057187
# Residual map preparation

# Match CRS
philly_neighborhoods <- st_transform(philly_neighborhoods, st_crs(final_data))

# Spatial join: assign each point to a neighborhood
points_with_neighborhood <- st_join(final_data, philly_neighborhoods)

# Add residual column
points_with_neighborhood$residuals <- residuals(final_model)

# Calculate average residual by neighborhood
neighborhood_residuals <- points_with_neighborhood %>%
  st_drop_geometry() %>%
  group_by(MAPNAME) %>%
  summarise(
    mean_residual = mean(residuals, na.rm = TRUE),
    median_residual = median(residuals, na.rm = TRUE),
    n_sales = n()
  )

# Join to neighborhoods
neighborhoods_with_residuals <- philly_neighborhoods %>%
  left_join(neighborhood_residuals, by = "MAPNAME")
# Map the averaged residuals

neighborhood_map <- ggplot(neighborhoods_with_residuals) +
  geom_sf(aes(fill = mean_residual), color = "black", size = 0.3) +
  scale_fill_gradient2(
    low = "blue2", 
    mid = "#f5f4f0", 
    high = "red2",
    midpoint = 0,
    labels = scales::dollar,
    name = "Mean Residual ($)"
  ) +
  theme_minimal() +
  labs(
    title = "Average Model Residuals by Neighborhood",
    subtitle = "Red = Under-Predicted | Blue = Over-Predicted"
  ) +
  theme(
    plot.title = element_text(face = "bold", size = 16),
    axis.text = element_blank(),
    axis.ticks = element_blank(),
    panel.grid = element_blank()
  )

neighborhood_map

#ggsave("slide_images/residual-neighborhood.png", plot = neighborhood_map, width = 8, height = 6, units = "in", dpi = 300)

University City is the hardest to predict, Penn owns a lot of property that doesn’t get taxed. And some less wealthy and disinvested neighborhoods are overvalued, like Parkside and Wynnefield in West Philadelphia, but the overall model predicts pretty accurately for most neighborhoods in Philadelphia.

Cross-Validation

10-Fold Cross-Validation

library(caret)
# 10-fold cross-validation
# first model
set.seed(0)

train_control <- trainControl(method = "cv", number = 10, savePredictions="final")

cv_model_1 <- train(sale_price ~ ln_square_feet + bath_num + # Structural.
                    ac_binary + fireplace_num + story_num + garage_num + # Structural.
                    basement_type + fuel_type + # Categorical structural.
                    age + I(age^2), # Age polynomial.
                    data = final_data,
                    method = "lm",
                    trControl = train_control)

cv_model_1$results
  intercept     RMSE  Rsquared      MAE   RMSESD RsquaredSD    MAESD
1      TRUE 175546.6 0.5950707 102835.8 25123.06 0.04133805 5470.833
library(caret)
# 10-fold cross-validation
# second model
set.seed(0)

train_control <- trainControl(method = "cv", number = 10, savePredictions="final")

cv_model_2 <- train(sale_price ~ ln_square_feet + bath_num + # Structural.
                    ac_binary + fireplace_num + story_num + garage_num + # Structural.
                    basement_type + fuel_type + # Categorical structural.
                    age + I(age^2) + # Age polynomial.
                    medhhinc + pct_vacant + pct_single_family, # census
                    data = final_data,
                    method = "lm",
                    trControl = train_control)

cv_model_2$results
  intercept     RMSE  Rsquared      MAE   RMSESD RsquaredSD    MAESD
1      TRUE 160991.5 0.6602203 88864.19 26143.65 0.04917685 4185.943
library(caret)
# 10-fold cross-validation
# third model
set.seed(0)

train_control <- trainControl(method = "cv", number = 10, savePredictions="final")

cv_model_3 <- train(sale_price ~ ln_square_feet + bath_num + # Structural.
                    ac_binary + fireplace_num + story_num + garage_num + # Structural.
                    basement_type + fuel_type + # Categorical structural.
                    age + I(age^2) + # Age polynomial.
                    medhhinc + pct_vacant + pct_single_family +
                    city_dist_mi + septa_half_mi + ln_park_dist + knn_amenity_mi, # Spatial
                    data = final_data,
                    method = "lm",
                    trControl = train_control)

cv_model_3$results
  intercept     RMSE  Rsquared      MAE   RMSESD RsquaredSD    MAESD
1      TRUE 153602.5 0.6908723 84966.83 27408.26 0.05515752 4235.502
library(caret)
# 10-fold cross-validation
# third model
set.seed(0)

train_control <- trainControl(method = "cv", number = 10, savePredictions="final")

cv_model_4 <- train(sale_price ~ ln_square_feet + bath_num + # Structural.
                    ac_binary + fireplace_num + story_num + garage_num + # Structural.
                    basement_type + fuel_type + # Categorical structural.
                    age + I(age^2) + # Age polynomial.
                    medhhinc + pct_vacant + pct_single_family +
                    city_dist_mi + septa_half_mi + ln_park_dist + knn_amenity_mi + # Spatial
                    knn_amenity_mi * city_dist_mi + septa_half_mi * city_dist_mi + # Interaction between amenities and downtown distance.
                    neighborhood_fe, # Fixed effect  
                    data = final_data,
                    method = "lm",
                    trControl = train_control)

cv_model_4$results
  intercept     RMSE  Rsquared      MAE   RMSESD RsquaredSD    MAESD
1      TRUE 137477.2 0.7530079 72542.97 27858.68 0.05796299 4022.137
# Compare all 4 models

## Combine four models:cv_model_1, cv_model_2, cv_model_3, cv_model_4
model_compare <- bind_rows(
  cv_model_1$results %>% 
    mutate(Model = "Model 1: Structural"),
  cv_model_2$results %>% 
    mutate(Model = "Model 2: Structural + Census"),
  cv_model_3$results %>% 
    mutate(Model = "Model 3: Structural + Census + Spatial"),
  cv_model_4$results %>% 
    mutate(Model = "Model 4: Final Model")
) %>%
  select(Model, RMSE, Rsquared, MAE) %>% 
  mutate(across(c(RMSE, Rsquared, MAE), round, 3))

## Plot
model_kable <- kable(model_compare,
                     col.names = c("Model", "RMSE ($)", "R²", "MAE ($)"),
                     caption = "Model Performance Comparison (10-Fold Cross-Validation)",
                     digits = 4,
                     format.args = list(big.mark = ",")
) %>%
  kable_styling(latex_options = "striped", full_width = FALSE) %>%
  column_spec(1, bold = TRUE, width = "10cm") %>%
  row_spec(0, color = "#f5f4f0", background = "#ff4100")

model_kable
Model Performance Comparison (10-Fold Cross-Validation)
Model RMSE ($) MAE ($)
Model 1: Structural 175,546.6 0.595 102,835.78
Model 2: Structural + Census 160,991.5 0.660 88,864.19
Model 3: Structural + Census + Spatial 153,602.5 0.691 84,966.83
Model 4: Final Model 137,477.2 0.753 72,542.97
# Create predicted vs. actual plot

pred_df <- cv_model_4$pred

# Plot Predicted vs Actual 
pred_v_act <- ggplot(pred_df, aes(x = obs, y = pred)) +
  geom_point(alpha = 0.5, color = "#2C7BB6") +
  geom_abline(intercept = 0, slope = 1, color = "red", linetype = "dashed") +
  labs(
    title = "Predicted vs. Actual Sale Price",
    x = "Actual Sale Price ($)",
    y = "Predicted Sale Price ($)"
  ) +
  scale_x_continuous(labels = scales::dollar) +
  scale_y_continuous(labels = scales::dollar) +
  theme_minimal() +
  theme(
    plot.title = element_text(face = "bold", size = 14, margin = margin(b = 10)),
    panel.grid.minor = element_blank(),
    axis.title.x = element_text(margin = margin(t = 10)),
    axis.title.y = element_text(margin = margin(r = 10))
  )

pred_v_act

Model Diagnostics

Check Assumptions

resid_df <- data.frame(
  fitted = fitted(final_model),
  residuals = resid(final_model)
)

resid_v_fit <- ggplot(resid_df, aes(x = fitted, y = residuals)) +
  geom_point(alpha = 0.5, color = "#2C7BB6") +
  geom_hline(yintercept = 0, color = "red", linetype = "dashed") +
  labs(
    title = "Residuals vs. Fitted Values",
    x = "Fitted Values (Predicted Sale Price)",
    y = "Residuals"
  ) +
  scale_x_continuous(labels = scales::dollar) +
  scale_y_continuous(labels = scales::dollar) +
  theme_minimal() +
  theme(
    plot.title = element_text(face = "bold", size = 14, margin = margin(b = 10)),
    panel.grid.minor = element_blank(),
    axis.title.x = element_text(margin = margin(t = 10)),
    axis.title.y = element_text(margin = margin(r = 10))
  )

resid_v_fit

Residuals vs Fitted Values:

The residual plot shows that most residuals are centered around zero, but the spread of residuals increases as the fitted values grow. This “funnel-shaped” pattern suggests potential heteroskedasticity, meaning the variance of errors may increase for higher-priced properties. While the overall linearity assumption appears reasonable, the increasing dispersion indicates that the model’s prediction error is not constant across the price range.

resid_df <- data.frame(residuals = residuals(final_model))

# Q-Q Plot
qq_plot <- ggplot(resid_df, aes(sample = residuals)) +
  stat_qq(color = "#2C7BB6", alpha = 0.6, size = 2) +      
  stat_qq_line(color = "red", linetype = "dashed") +        
  labs(
    title = "Normal Q-Q Plot of Residuals",
    subtitle = "Check for normality assumption.",
    x = "Theoretical Quantiles (Normal Distribution)",
    y = "Sample Quantiles (Residuals)"
  ) +
  scale_x_continuous(labels = scales::comma) +
  scale_y_continuous(labels = scales::dollar) +
  theme_minimal() +
  theme(
    plot.title = element_text(face = "bold", size = 14, margin = margin(b = 10)),
    plot.subtitle = element_text(size = 11),
    panel.grid.minor = element_blank(),
    axis.title.x = element_text(margin = margin(t = 10)),
    axis.title.y = element_text(margin = margin(r = 10))
  )

qq_plot

Normal Q–Q Plot:

The Q–Q plot reveals that the residuals deviate from the reference line in both tails, especially in the upper tail. This pattern indicates that the residuals are right-skewed and not perfectly normally distributed. The deviation is mainly driven by a small number of very high sale-price observations, which pull the residual distribution upward. However, moderate departures from normality are common in housing price data and generally do not invalidate the model.

cd <- cooks.distance(final_model)

used_row_idx <- as.integer(rownames(model.frame(final_model)))

cooks_df <- tibble(
  row_in_data  = used_row_idx,           
  row_in_model = seq_along(cd),          
  cooks_distance = as.numeric(cd)
)

n_used <- length(cd)
threshold <- 4 / n_used

cooks_plot <- ggplot(cooks_df, aes(x = row_in_model, y = cooks_distance)) +
  geom_bar(stat = "identity", fill = "#2C7BB6", alpha = 0.6) +
  geom_hline(yintercept = threshold, color = "red", linetype = "dashed") +
  coord_cartesian(ylim = c(0, 0.02)) +  
  labs(
    title = "Cook's Distance (Zoomed In)",
    subtitle = paste0("Red Dashed Line = 4/n ≈ ", round(threshold, 5)),
    x = "Observation Index (In-Model)",
    y = "Cook's Distance"
  ) +
  scale_x_continuous(labels = scales::comma) +
  scale_y_continuous(labels = scales::comma) +
  theme_minimal() +
  theme(
    plot.title = element_text(face = "bold", size = 14, margin = margin(b = 10)),
    plot.subtitle = element_text(size = 11),
    panel.grid.minor = element_blank(),
    axis.title.x = element_text(margin = margin(t = 10)),
    axis.title.y = element_text(margin = margin(r = 10))
  )

cooks_plot

# Most influential
top_influential <- cooks_df %>%
  filter(cooks_distance > threshold) %>%
  arrange(desc(cooks_distance)) %>%
  slice_head(n = 10) %>%
  mutate(
    sale_price = final_data$sale_price[row_in_data]   
  ) %>%
  select(row_in_model, row_in_data, sale_price, cooks_distance)

top_influential
# A tibble: 10 × 4
   row_in_model row_in_data sale_price cooks_distance
          <int>       <int>      <dbl>          <dbl>
 1         6227        6227    3600000         0.197 
 2         6516        6516    5477901         0.129 
 3          254         254    3330400         0.0459
 4         8059        8060     601004         0.0448
 5         2569        2569    3995000         0.0391
 6        11502       11503     281000         0.0355
 7         7736        7737     330000         0.0324
 8         2669        2669    3000000         0.0283
 9         2003        2003     170000         0.0242
10         5537        5537    3850000         0.0230

Cook’s Distance:

The Cook’s distance plot shows that almost all observations have very small influence values (below the 4/n threshold), indicating that the model is not dominated by a few extreme points. A few cases exhibit slightly higher Cook’s D values, suggesting the presence of mildly influential outliers, but none appear to exert excessive leverage on the regression coefficients.

Detailed Discussion

Our final model achieves a cross-validated R² of 0.746, explaining approximately 75% of the variance in Philadelphia residential sale prices. The Mean Absolute Error (MAE) of 72,299 USD indicates that, on average, predicted sale prices deviate from actual prices by roughly 29% of the median home price (250,000 USD). However, the Root Mean Squared Error (RMSE) of 138,257USD—nearly double the MAE—reveals that the model struggles disproportionately with high-value properties. This discrepancy, combined with residuals ranging from -817,701 USD to +5.1 million USD, reflects the outsized influence of luxury homes on overall error. Diagnostic plots confirm these patterns: the Q-Q plot shows deviation from normality in both tails (especially the upper tail), while the residuals vs. fitted values plot exhibits a funnel-shaped pattern indicating heteroskedasticity—prediction error variance increases systematically for higher-priced properties. Despite these issues, the median residual of 2,339 USD (near zero) suggests the model’s predictions are generally unbiased for typical homes, and Cook’s distance values remain well below concerning thresholds, indicating no single observation dominates the model.

The model’s minimum and maximum residuals range from -817,701 USD to +5.1 million USD, reflecting the outlier influence from luxury properties. While simultaneously referencing the QQ Plot, the model reflects two tails in the plot, but more in the positive region, indicating the model is not 100% normal. However, the residual distribution with the median residual of 2,339 USD is fairly close to 0, meaning the model predictions are generally centered about sale prices with limited bias in sale prediction. It is imperative to keep in mind that in the residuals vs fitted values plot, the increase of observations fanning-out as the predicted sale price value increases indicates the existence of heteroskedasticity between some variables.

We have concluded that ln_square_feet, neighborhood fixed effects, median household income, number of bathrooms and number of fireplaces as the most significant variables in our model. We calculated this by doing a stepwise regression, and these five yield the highest \(R^2\).

We grossly underpredicted for University City, and a decent amount for Fairmount. This model struggles to accurately predict rural areas and overpredicted uninvested neighborhoods. The severe underprediction of University City may be due to the existence and proximity to Drexel University and the University of Pennsylvania.

External Resources

Relevant Readings

Artificial Intelligence

  • Claude: For code debugging in data cleaning and visualizations.

  • Chat GPT: For code debugging in data cleaning and visualizations.